-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
This is a generic tracking issue for improvements to TensorBoard's Python filesystem abstraction. (This doesn't impact Rustboard, which today should be the read path for regular users, but there are still users who have to use the Python read path because they can't use Rustboard, and the write path for TensorBoard's summary APIs to write without TensorFlow is still entirely in Python.)
Currently, TensorBoard's Python code relies on the TensorFlow filesystem API via tf.io.gfile
, which has built-in support for a number of protocols besides local disk. When TensorFlow is not available, we fall back to a stub implementation of the tf.io.gfile
API in tensorboard/compat/tensorflow_stub/io/gfile.py
, which today only supports local disk and S3. The stub implementation achieves some reuse between filesystem implementations by having an intermediate set of methods that both implementations expose, but this abstraction A) isn't even defined other than by virtue of common methods, i.e. there's no abstract base class or spec, and B) has a number of issues.
Here are a few of the issues with that abstraction:
- underspecified behavior in a number of places (e.g. what characters glob supports), which makes it harder to guarantee much in the way of consistency across implementations
- little clarity/documentation about how filenames are treated as bytes vs strings
- no "file" object at all, which is mostly ok for remote filesystems where everything is an RPC anyway, but is significantly worse for filesystems where there actually might be local state between, say, multiple reads or writes to a single file (e.g. local file descriptors, buffers, caches, etc.). Right now we ignore all of that, so it's conceivable that in cases we end up with quadratic behavior where each read at offset N is independent and must first seek past all the previous data.
- no
walk()
method for filesystems, just a function, as pointed out in gfile: add support for fsspec filesystems #5248 (comment), so filesystems aren't able to take advantage of fast paths they might support
We already have a goal (bottom of #3666) to ultimately migrate away from the stub approach and instead properly inject a filesystem abstraction into places that do I/O. As part of that, we should consider revisiting this abstraction and formalizing it properly. One possibility is to just adopt the existing fsspec
abstraction, since it's a whole project around this idea: https://filesystem-spec.readthedocs.io/en/latest/
If we did that, we'd likely still want to be able to use the TF filesystem APIs when available (so that existing TensorBoard users who also have TensorFlow and thus benefit from built-in support for cloud filesystems don't experience a regression), so one open question would be determining whether it's possible to shoehorn the TF filesystem support into this setup.
Activity
[-]Streamline and improve TensorBoard's Python filesystem abstraction[/-][+]Improve and formalize TensorBoard's Python filesystem abstraction[/+]nfelt commentedon Jan 7, 2022
Some updates: we've now added fsspec filesystem support via #5248, but it's still optional (we don't depend on fsspec, so users have to install it) and it's also only accessible when
tensorflow
is not installed, since the currenttensorboard.compat.tf
logic ensures we always use TF's filesystem API when it's available.Also, to clarify the nature of the abstraction discussed in this issue, it isn't really a filesystem per se, or at least not in the usual sense - we actually want it to represent a one-level-higher abstraction, a "filesystem proxy" that can resolve scheme-prefixed paths to multiple concrete filesystems. (That said, the ambiguity in terminology arises because it is sort of tempting to shoehorn this proxy object itself into a filesystem interface - like TB does with DataProvider and DispatchingDataProvider, more or less - but just like DispatchingDataProvider, this abstraction can get kind of leaky/unwieldy. And in practice with fsspec, it has already led to confusion around how things like globbing interact with schemes and prefixes.)
Furthermore, we actually have multiple "filesystem proxy" implementations in TensorBoard today:
tf.io.gfile.get_registered_schemes()
API was added so it's possible to dynamically find out if a scheme is supported)fsspec.get_filesystem_class()
to find out if a specific scheme is supported; some schemes require additional imports/dependencies, including S3 which requiress3fs
)Right now we have fallback from 2 to 3, but the choice of 1 vs 2&3 is mutually exclusive. We should fix this so TF users aren't artificially limited to only the TF filesystems, and can benefit from the fsspec support. The end state we want is that individual I/O calls can be dispatched across all possible proxies. This could be thought of at a 4th filesystem proxy in its own right, the "automatic" one, that supports the union of all the various concrete filesystems/schemes supported by TF, fsspec, and our built-in support. If users don't actually want automatic dispatch for some reason, we can provide a flag to specify that only a single proxy should be used (e.g. just TF).
For the automatic dispatch we'd basically have a flow roughly like this pseudocode:
vadimkantorov commentedon May 11, 2025
Am I right that now that #5248 is merged, we can use
tensorboard --logdir ssh://...
making use of this fsspec plugin?I tried it out. Seems working!