Skip to content

Add TIFF Key-Value Store Implementation (take 2) #230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 53 commits into
base: master
Choose a base branch
from

Conversation

hsidky
Copy link

@hsidky hsidky commented Apr 17, 2025

This PR adds a new (read only) key-value store driver for accessing tiles and strips within TIFF image files. This is my second attempt after finding the time to work on this given previous feedback on #138. I drew heavily upon the zip kvstore implementation and your prior feedback (thanks!), so I hope that things are more faithful to Tensorstore patterns.

The basic idea is this kvstore sits on top of a base kvstore and maps IFDs/tiles/stripes to the following key format: tile/<id>/<row>/<col>.

My plan is to get feedback on whether this something that is suitable to be merged. If I receive a positive response, I work on any feedback I receive in parallel with building out the tiff driver that will pair with this kvstore.

Looking forward to your feedback on this implementation!

P.S. I couldn't for the life of me get subsystem logging to work (aka ABSL_LOG_IF). Would appreciate some advice on how to get that working.

@jbms
Copy link
Collaborator

jbms commented Apr 18, 2025

This is great --- I haven't had a chance to review your implementation in detail, but at a high level this is exactly the right implementation strategy.

Do you have a reason to use this tiff kvstore driver on its own? Otherwise I'd be inclined to eliminate the JSON registration for it, such that it can only be created internally (by the to-be-written Tiff TensorStore driver).

@jbms
Copy link
Collaborator

jbms commented Apr 18, 2025

Regarding logging: what have you tried? Are you testing with bazel? With bazel you can do --test_env=TENSORSTORE_VERBOSE_LOGGING=tiff

@hsidky
Copy link
Author

hsidky commented Apr 19, 2025

This is great --- I haven't had a chance to review your implementation in detail, but at a high level this is exactly the right implementation strategy.

Do you have a reason to use this tiff kvstore driver on its own? Otherwise I'd be inclined to eliminate the JSON registration for it, such that it can only be created internally (by the to-be-written Tiff TensorStore driver).

Thanks for that feedback. No, I don't think there is any reason to use the tiff kvstore on its own. I can eliminate the JSON registration.

Regarding logging: what have you tried? Are you testing with bazel? With bazel you can do --test_env=TENSORSTORE_VERBOSE_LOGGING=tiff

Yes, I'm using bazel though I'm not terribly familiar with it. That works! I had previously tried setting the environment variable as that worked for non-subsystem logging in tests, but it wasn't working for conditional logging. Thank you.

I'm going to start chipping away at the tiff driver as I await detailed feedback on this kvstore.

@hsidky
Copy link
Author

hsidky commented May 5, 2025

Hello again. I've managed to develop a driver per your suggestions and I believe I have something that's a pretty solid foundation with a reasonable amount of support. There are some outstanding items + polish but I think it makes sense for me to pause here for feedback in case major changes are required and to understand the likelihood of being able to get this merged first.

A few notes on the driver implementation:

  • I changed the tiff kvstore to work off of a linear index rather than an xy grid. It made it easier to handle samplesperpixel>1 but did offload more work to getchunkstoragekey, so I did a bit of optimization there.
  • I haven't worked on transaction support as I haven't fully wrapped my head around the nuts and bolts of how that works yet. Is this a requirement for a merge? I saw, for example, that the single image driver does not support transactions, so I assume not.
  • Same as above but for storage stats.
  • I did not commit the test file binaries (generated by the generate.py file). I'll wait until you give me clearance.
  • I waffled a bit on how much to build out the metadata property. I ended up putting options under a separate tiff property as they mainly correspond to the interpretation of the data. I can revisit if you have strong opinions.

I think that's it for now. Looking forward to any feedback!

Optional. Must match the length of `dimensions`. Required if `dimensions`
has more than one entry. If only one dimension is specified, `ifd_count` can
be used instead. If both are specified, their product must match `ifd_count`.
ifd_count:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be specified --- can't we determine the number of IFDs from the file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it doesn't. I initially was thinking about wanting to use only a subset of IFDs, but then really ifd_count would be one of many parameters that a user would need to specify in order to provide them with the kind of flexibility to do that properly. I will remove.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I double checked and I see now that IFDs may be ignored in OME XML so I will I need to add full specification functionality rather than remove this. https://docs.openmicroscopy.org/ome-model/5.6.3/ome-tiff/specification.html

description: |
Optional. Specifies the total number of IFDs involved in the stack.
Required if `dimension_sizes` is not specified for a single stack dimension.
If specified along with `dimension_sizes`, their product must match `ifd_count`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dimensions, dimension_sizes and ifd_sequence_order can be removed as options. Instead, there should be a single unnamed dimension as the "stack" dimension. Then a separate reshape adapter TensorStore driver (not yet implemented, but on the todo list can be responsible for doing the equivalent of those options.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Instead of an unlabeled dimension, the stack dimension could also be assigned some default label like stack)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue with that, if I understand enough about how tensorstore works, is that this will become an problem when we are pulling that metadata from the file itself (e.g. OME-TIFF), versus having the user specify that information (which they may not know since it's embedded in the image) using a reshape adapter.

The reason I added these in here was to build out the necessary functionality and options such that I can later on just read the XML data from the appropriate TIFF tag and assign it to existing parameters. I did not add that in yet because I wanted to get the base, non-OME-specific, TIFF reader cleared first. Then adding in the XML parsing and populating these fields becomes a trivial exercise.

If you feel differently, or would still prefer that I just assign a single stacking dimension, I can do that. But (a) I think that means a longer runway for OME-TIFF support and (b) I don't know then how the XML metadata would work its way up to the reshape adapter.

Please advise.

the last dimension varying fastest.
required:
- dimensions
sample_dimension_label:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary because the dimension labels can instead be overridden via the index transform.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I added this is because OME XML has dimension labels in its metadata (XYZCT). I referenced the n5 driver a lot in my development and I saw that it has an axes member for dimension labels in N5Metadata. As with my previous comment, right now, they are manually populated, but when the OME XML is parsed, I will populate this.

I realize I am in some ways optimizing for OME XML while trying to make a generic solution for TIFF images. The way I had envisioned it was that these are generic options for anyone to use to craft some interpretation of their TIFF file, though I can see how it can look a bit odd. Also happy to follow your guidance on this. Just explaining my rationale.

allOf:
- type: object
properties:
dtype:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be excluded since the data type can be specified generically via the dtype and schema properties common to all drivers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the metadata property (which is just dtype and shape below), I think I ultimately got confused. I was trying to square three things: (a) the concept of metadata constraints that I saw in n5, (b) come up with a generic TIFF solution, and (c) preempt support for OME-XML. The properties that I put inside the tiff object - if we set aside OME XML - are technically not constraints. They specify different ways of interpreting the same underlying data (e.g. whether IFDs are stacked as a dimension or treated as separate images is in the mind of the user, not something the TIFF file is stipulating). It's only with OME-XML that it becomes true metadata.

I used the metadata object for properties that are not subject to interpretation, like the dtype, and shape (at least partially). I realize this is quite messy. I could perhaps move the tiff object into metadata, and then it functions the same way it does with n5: all members are optional, and they are populated if found (e.g. OME XML). But it will also serve one additional function: which is to specify the interpretation of the underlying data if no OME metadata is present. This is probably cleaner from a schema perspective, but it does mix up two distinct responsibilities.

Let me know what you prefer.

$ref: DataType
title: Data type constraint.
description: Constrains the expected data type of the TIFF dataset.
shape:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can likewise be excluded since it can be specified via the schema.

// This registry maps string IDs (like "lzw", "deflate") to factories/binders
// capable of creating JsonSpecifiedCompressor instances.
internal::JsonSpecifiedCompressor::Registry& GetTiffCompressorRegistry();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unlikely that we need to register tiff compressor to start with, also the tiff tag value could map directly to a compressor instance (via an enum, or similar).

};

// Common compression types
enum class CompressionType : uint16_t {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably find the hex values easier to track back to the spec.


namespace tensorstore {
namespace internal_tiff_kvstore {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a reference to the spec sections here as comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants