Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fsid implementation #698

Closed
wants to merge 3 commits into from

Conversation

jwerbin
Copy link

@jwerbin jwerbin commented Feb 4, 2023

Fixes issue #699
As of the latest fsspec release (2023.1.0) a new abstract property was added to the AbstractFileSystem base class.
s3fs should implement this property to prevent problems for users.

"""Persistent filesystem id that can be used to compare filesystems
across sessions.
"""
return "s3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jwerbin , glad to see that someone else has a use case for this 🙂 What do you use it for, btw?

Also, for s3, fsid should not be just "s3" (maybe you are looking for fs.protocol if this was enough?). It should instead take into account endpoint as well. The idea for fsid is to be able to identify the filesystem across different runs and s3fs is really determined by the endpoint used (at the very least). fsspec/filesystem_spec#773

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop
Sorry to say that I don't have specific use case for this but I like it as an idea.

I got tripped up because I have a use case for the S3Filesystem where I wanted to wrap all the methods of an instance.
When I do the object introspection I ended up triggering S3FileSystem.fsid().

Which threw. So I figured I would come implement it.

Happy to add the endpoint.
I am not 100% sure on how to add it correctly.

Tell me if this seems reasonable.

if S3FileSystem._s3 object exists we should be able to get the endpoint as follows
endpoint = repr(self._s3._client._endpoint)

If none then either return "" or "Session not set"

Then return something like "s3" + (f"::{endpoint}" if endpoint else ""

For your specified use in DVC is that enough to capture the specific location?

Or I should it just be forced to try to connect when trying to access the fsid?
Will this create some odd behavior as the object connects and disconnects over its' lifetime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwerbin Not all methods should be implemented though. You will find multiple cases where filesystems don't implement some methods or properties, it is normal.

Looks like you don't need to implement fsid, but rather need to account for NotImplemented error in your code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop Ah this might be my confusion on conventions. I thought methods that raise NotImplemented errors in a base class (AbstractFileSystem) was a pythonic way to indicate that all derived classes should implement this method.

I am happy to proceed if you think the fsid approach that I proposed for the s3fs seems reasonable to you and is helpful for the community.
"s3" + (f"::{endpoint}" if endpoint else ""

I have already built a work around into my code in the near term.
So I can drop it if you don't think it will add value to s3fs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwerbin It should be a little bit more involved then just adding the endpoint (per spec discussed in the original ticket above). Since you don't really need it, let's just close this PR then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like:

from fsspec.utils import tokenize

return f"s3-{tokenize(self.endpoint_url)}"

should be enough.

My point was that if you don't need it - probably no reason to implement it right now, because it means that it will be dead code for some time. If you are keen on contributing anyways, feel free to reopen a PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For s3 specifically, I would suggest that whether an instance is anonymous or not, and maybe the value of profile, if given, are also contenders for being included.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant But that goes against the idea of fssid of being able to identify the fs regardless of creds. If the endpoint is the same, it shouldn't matter if you are alice or bob, fsid should be the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on anecdotal usage: users sometimes do have different instances around for s3 with different profiles (say one for reading from a restricted bucket and another for writing elsewhere) even with the same end-point. Similarly, some find it useful to have an anon instance around for checking how what they have written works for non-authenticated users. This latter case is common for kerchunk users, where the data archive being scanned is public, but the references that point at it need to be written to s3 and made public (or requester-pays).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant I see your point about the use case. It seems like it is a little different from fsid though. To summarize

  1. current fs_token is about caching filesystem instances during the same session
  2. fsid is about identifying filesystems across sessions regardless of creds
  3. another id, that you are talking about, is about identifying not filesystems, but filesystem instances across sessions

Use cases for 2 actually need to be creds agnostic, because multiple people can be working on the same project that uses the same storage but each user has its own creds. I think we've touched on this during fsid discussions before (e.g. fsspec/filesystem_spec#773 (comment)). Maybe you are talking about something like session_id described in the last link?

@efiop efiop closed this Feb 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants