add fsid implementation #698

jwerbin · 2023-02-04T00:43:05Z

Fixes issue #699
As of the latest fsspec release (2023.1.0) a new abstract property was added to the AbstractFileSystem base class.
s3fs should implement this property to prevent problems for users.

efiop · 2023-02-04T01:06:07Z

s3fs/core.py

+        """Persistent filesystem id that can be used to compare filesystems
+        across sessions.
+        """
+        return "s3"


Hey @jwerbin , glad to see that someone else has a use case for this 🙂 What do you use it for, btw?

Also, for s3, fsid should not be just "s3" (maybe you are looking for fs.protocol if this was enough?). It should instead take into account endpoint as well. The idea for fsid is to be able to identify the filesystem across different runs and s3fs is really determined by the endpoint used (at the very least). fsspec/filesystem_spec#773

@efiop
Sorry to say that I don't have specific use case for this but I like it as an idea.

I got tripped up because I have a use case for the S3Filesystem where I wanted to wrap all the methods of an instance.
When I do the object introspection I ended up triggering S3FileSystem.fsid().

Which threw. So I figured I would come implement it.

Happy to add the endpoint.
I am not 100% sure on how to add it correctly.

Tell me if this seems reasonable.

if S3FileSystem._s3 object exists we should be able to get the endpoint as follows
endpoint = repr(self._s3._client._endpoint)

If none then either return "" or "Session not set"

Then return something like "s3" + (f"::{endpoint}" if endpoint else ""

For your specified use in DVC is that enough to capture the specific location?

Or I should it just be forced to try to connect when trying to access the fsid?
Will this create some odd behavior as the object connects and disconnects over its' lifetime?

@jwerbin Not all methods should be implemented though. You will find multiple cases where filesystems don't implement some methods or properties, it is normal.

Looks like you don't need to implement fsid, but rather need to account for NotImplemented error in your code.

@efiop Ah this might be my confusion on conventions. I thought methods that raise NotImplemented errors in a base class (AbstractFileSystem) was a pythonic way to indicate that all derived classes should implement this method.

I am happy to proceed if you think the fsid approach that I proposed for the s3fs seems reasonable to you and is helpful for the community.
"s3" + (f"::{endpoint}" if endpoint else ""

I have already built a work around into my code in the near term.
So I can drop it if you don't think it will add value to s3fs.

@jwerbin It should be a little bit more involved then just adding the endpoint (per spec discussed in the original ticket above). Since you don't really need it, let's just close this PR then.

I think something like:

from fsspec.utils import tokenize return f"s3-{tokenize(self.endpoint_url)}"

should be enough.

My point was that if you don't need it - probably no reason to implement it right now, because it means that it will be dead code for some time. If you are keen on contributing anyways, feel free to reopen a PR.

For s3 specifically, I would suggest that whether an instance is anonymous or not, and maybe the value of profile, if given, are also contenders for being included.

@martindurant But that goes against the idea of fssid of being able to identify the fs regardless of creds. If the endpoint is the same, it shouldn't matter if you are alice or bob, fsid should be the same.

This is based on anecdotal usage: users sometimes do have different instances around for s3 with different profiles (say one for reading from a restricted bucket and another for writing elsewhere) even with the same end-point. Similarly, some find it useful to have an anon instance around for checking how what they have written works for non-authenticated users. This latter case is common for kerchunk users, where the data archive being scanned is public, but the references that point at it need to be written to s3 and made public (or requester-pays).

@martindurant I see your point about the use case. It seems like it is a little different from fsid though. To summarize

current fs_token is about caching filesystem instances during the same session

fsid is about identifying filesystems across sessions regardless of creds

another id, that you are talking about, is about identifying not filesystems, but filesystem instances across sessions

Use cases for 2 actually need to be creds agnostic, because multiple people can be working on the same project that uses the same storage but each user has its own creds. I think we've touched on this during fsid discussions before (e.g. fsspec/filesystem_spec#773 (comment)). Maybe you are talking about something like session_id described in the last link?

jwerbin added 2 commits February 3, 2023 16:40

Adds failing test

1aba2e6

add fsid implementation

084c41a

efiop reviewed Feb 4, 2023

View reviewed changes

Needed property decorator

fffedcf

efiop closed this Feb 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fsid implementation #698

add fsid implementation #698

jwerbin commented Feb 4, 2023 •

edited

Loading

efiop Feb 4, 2023

jwerbin Feb 4, 2023

efiop Feb 5, 2023

jwerbin Feb 5, 2023

efiop Feb 5, 2023

efiop Feb 6, 2023

martindurant Feb 6, 2023

efiop Feb 6, 2023

martindurant Feb 6, 2023

efiop Feb 6, 2023

add fsid implementation #698

add fsid implementation #698

Conversation

jwerbin commented Feb 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwerbin commented Feb 4, 2023 •

edited

Loading