-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow reading and writing compressed tiles #3992
Conversation
Also required update to TiffCompression API to get the underlying Codec instance.
This doesn't work quite yet, but gives a general idea of how the API currently works.
Needs some additional compatibility checks to make sure input data's compression type matches output compression type.
This updates the DICOM writer to enforce the same tile size for the whole dataset at the {set,get}TileSize{X,Y} level. setId and save*Bytes are not currently set up to allow different tile sizes for different images (though we could change that). If bfconvert encounters an input image whose tile size does not match the output tile size, then precompressed tiles are not used.
…recompressed-tiles
…recompressed-tiles
Seems to work in initial test with bfconvert on a JPEG-2000-compresed SVS, but may require additional changes following further testing.
…recompressed-tiles
Sometimes the assumption that compressed tiles are valid is not reliable - some SVS files seem to have zero length frames for some tiles (which is illegal TIFF IMHO, but it happens) and this causes DICOM readers to get out of sync, so the zero-length JPEG bitstream for that tile needs to be replaced with a genuine "blank" JPEG bitstream (e.g., a totally white tile synthesized with the right frame size). If not corrected this results in tearing of the image as one zooms into higher resolution. See this example: ([email protected] should have access to that bucket) PS. I have also occasionally seen in SVS files some invalid JPEG bitstreams that don't start with an SOI marker segment, but this doesn't seem to bother your converter, at least when the output is viewed in QuPath. See for example: https://storage.cloud.google.com/idc-source-icdc-glioma/GLIOMA01-i_8A0A.svs Not sure how you are handling the insertion of missing JPEG tables for those bitstreams for such frames. Apart from those issues I have not encountered other scenarios where the compressed bitstreams in the TIFF SVS) tiles do not match expectations. |
Thanks for looking at this, @dclunie.
I don't have access there, or to https://storage.cloud.google.com/idc-source-icdc-glioma/GLIOMA01-i_8A0A.svs. As with previous datasets, you will need to add individual email addresses (we can't actually log in as |
Oops ... try [email protected] now ... |
That's great, thank you - I am able to see both files now. |
Conflicting PR. Removed from build BIOFORMATS-push#69. See the console output for more details.
--conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carried out some functional testing using the standard CMU-1.svs
sample file from OpenSlide as the input and the command-line bfconvert
utility primarily.
Confirmed that using any other format than DICOM as the output e.g. TIFF, OME-TIFF fails with a meaningful exception when passing -precompressed
. The same behavior happens when selecting any other input format which does not support precompressed tile reading.
The following table reports the conversion time and DICOM filesets size when converting CMU-1.svs
into a WSI DICOM with various options
Execution time | Size | |
---|---|---|
-precompressed -noflat -compression JPEG | 21.32s | 181M |
-precompressed -noflat -compression JPEG-2000 | 16.601s | 182M |
-noflat -compression JPEG -tilex 256 -tiley 256 | 160.937s | 122M |
-noflat -compression JPEG-2000 -tilex 256 -tiley 256 | 566.435s | 549M |
A few issues noted while running and validating the conversion workflows above:
- omitting
-tilex
and-tiley
when converted without the-precompressed
fails withjava.lang.ArithmeticException: / by zero
as the tile size is reported asTile size = 0 x 0
- the DICOM file created with JPEG-2000 and pre-compression fails to open with the
showinf
utility. The utility seems to be stuck at theCalculating image offsets
step
I will run additional validation and comparative tests tomorrow using the last released version of Bio-Formats to identify whether these are regressions.
byte[] buf = getTile(reader, writer.getResolution(), index, | ||
xCoordinate, yCoordinate, width, height); | ||
|
||
// if we asked for precompressed tiles, but that wasn't possible, | ||
// then log that decompression/recompression happened | ||
// TODO: decide if an exception is better here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in my testing of the CMU-1.svs
sample file from OpenSlide, there was only plane where this was an issue. I don't think throwing an exception is a better alternative since it might significantly restrict the usage of this feature and might happen quite late in the conversion as well which is frustrating for the end-user.
Additionally, there is still some significant gain associated to using precompressed tiles for the highest resolutions even if the lowest resolutions are decompressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the risk of adding too much complexity, I could see having an option (flag or -option
) to also provide whichever of warning or error is not the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now having it logged is a better option than throwing the exception
// TODO: decide if an exception is better here? | ||
if (precompressed && !tryPrecompressed) { | ||
LOGGER.warn("Decompressed tile: series={}, resolution={}, x={}, y={}", | ||
writer.getSeries(), writer.getResolution(), x, y); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning message is currently written for every single tile of the resolution but it applies to the entire resolution. Would it make sense to compute and display it outside the loop?
Otherwise we might want to align the display with the rest of the utility as this reads a bit confusing if you don't know what is happening
Tile size = 256 x 256
Decompressed tile: series=0, resolution=3, x=0, y=0
Decompressed tile: series=0, resolution=3, x=1, y=0
Decompressed tile: series=0, resolution=3, x=2, y=0
Decompressed tile: series=0, resolution=3, x=3, y=0
Decompressed tile: series=0, resolution=3, x=0, y=1
Decompressed tile: series=0, resolution=3, x=1, y=1
Decompressed tile: series=0, resolution=3, x=2, y=1
Decompressed tile: series=0, resolution=3, x=3, y=1
Decompressed tile: series=0, resolution=3, x=0, y=2
Decompressed tile: series=0, resolution=3, x=1, y=2
Decompressed tile: series=0, resolution=3, x=2, y=2
Decompressed tile: series=0, resolution=3, x=3, y=2
Series 0: converted 1/1 planes (100%)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the whole resolution needs to be decompressed/recompressed, this should now only warn once. If some of the tiles in the resolution work with -precompressed
and some do not, it will still warn once for each tile that doesn't work. Mostly I'd expect that to happen with right and bottom edge tiles, which could be smaller and need to be padded to the correct size.
boolean sameTileWidth = reader.getOptimalTileWidth() == writer.getTileSizeX(); | ||
boolean sameTileHeight = reader.getOptimalTileHeight() == writer.getTileSizeY(); | ||
|
||
// TODO: this needs to check for compatible compression options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would these be additional arguments to the signature? Or is there some mechanism to introspect the compression of a given resolution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30f1c3d should address this, by allowing the Codec
(not string compression type) to be retrieved from the writer. If the reader and writer Codec
are not instances of the same class, then using precompressed tiles is not allowed. This should solve the issues with -precompressed -compression JPEG-2000
with CMU-1.svs
and -precompressed -compression JPEG
with JPEG-2000 data.
Conflicting PR. Removed from build BIOFORMATS-push#689. See the console output for more details.
--conflicts |
Retested conversion of the same CMU-1.svs file with the latest released Bio-Formats (7.0.1). Omitting
This error comes from the work introduced in #4097. However passing explicitly |
@@ -384,6 +388,10 @@ private void printUsage() { | |||
" -no-sequential: do not assume that planes are written in sequential order", | |||
" -swap: override the default input dimension order; argument is f.i. XYCTZ", | |||
" -fill: byte value to use for undefined pixels (0-255)", | |||
" -precompressed: transfer compressed bytes from input dataset directly to output.", | |||
" Most input and output formats do not support this option.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two quick brainstormy thoughts:
- I could see an option to
formatlist
that would print out a table of which of these interfaces is supported by each format. - It would be interesting to see (OME-)TIFF as another supported writer format for this to have an N>2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree these would be nice targets for future work, but are well outside the scope of this PR and not realistic for 7.1.0. OK to open new issues for each, so that we can schedule for future releases?
Conflicting PR. Removed from build BIOFORMATS-push#690. See the console output for more details.
--conflicts |
Conflicting PR. Removed from build BIOFORMATS-push#691. See the console output for more details.
--conflicts |
Conflicting PR. Removed from build BIOFORMATS-push#692. See the console output for more details.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went through a more comprehensive testing using the following sample files
- the CC0
CMU-1.svs
sample mentioned above where all original IFDs use JPEG compression except for the macro image which is LZW-compressed - the CC-BY 77917.svs sample file where all IFDs associated with the whole slide image use JPEG-2000 compression, the macro image uses LZW compression and the label image uses JPEG compression
Conversion was performed: - using Bio-Formats 7.0.1 with JPEG and JPEG-2000 compression (
-noflat -compression JPEG -tilex 256 -tiley 256
and-noflat -compression JPEG-2000 -tilex 256 -tiley 256
) - using Bio-Formats 7.1.0-SNAPSHOT with this PR included with JPEG and JPEG-2000 compression (
-noflat -compression JPEG -tilex 256 -tiley 256
and-noflat -compression JPEG-2000 -tilex 256 -tiley 256
) - using Bio-Formats 7.1.0-SNAPSHOT with this PR included with JPEG and JPEG-2000 compression using pre-compressed tiles (
-precompressed -noflat -compression JPEG
and-precompressed -noflat -compression JPEG-2000
)
The result of the conversion is output below
Source | Bio-Formats | Compression | Pre-compressed | Execution time | Size | setId time |
---|---|---|---|---|---|---|
CMU-1.svs | 7.0.1 | JPEG | no | 237.062 | 109M | 1.613s |
CMU-1.svs | 7.0.1 | JPEG-2000 | no | 735.462s | 539M | 4.749s |
77917.svs | 7.0.1 | JPEG | no | 1279.184s | 815M | 11.341s |
77917.svs | 7.0.1 | JPEG-2000 | no | 3248.119s | 5.0G | 39.501s |
CMU-1.svs | 7.1.0-SNAPSHOT | JPEG | no | 238.611s | 109M | 1.622s |
CMU-1.svs | 7.1.0-SNAPSHOT | JPEG-2000 | no | 683.246s | 539M | 4.637s |
77917.svs | 7.1.0-SNAPSHOT | JPEG | no | 1232.993s | 815M | 7.805s |
77917.svs | 7.1.0-SNAPSHOT | JPEG-2000 | no | 3084.838s | 5.0G | 36.061s |
CMU-1.svs | 7.1.0-SNAPSHOT | JPEG | yes | 85.187s | 176M | 2.352s |
CMU-1.svs | 7.1.0-SNAPSHOT | JPEG-2000 | yes | 87.122s | 178M | N/A |
77917.svs | 7.1.0-SNAPSHOT | JPEG | yes | 264.604s | 673M | N/A |
77917.svs | 7.1.0-SNAPSHOT | JPEG-2000 | yes | 257.726s | 674M | 4.733s |
All datasets include valid TIFF tag as per the work of #3911 The DICOM datasets listed with N/A in the initialization time above hang at the Calculating image offsets
stage. Running tiffinfo
also displays a warning
$ tiffinfo 7.1.0-SNAPSHOT/jpeg_precompressed/77917/77917_0_0.dcm
JPEGFixupTagsSubsampling: Warning, Unable to auto-correct subsampling values, likely corrupt JPEG compressed data in first strip/tile; auto-correcting skipped.
TIFF Directory at offset 0x2424ab62 (606382946)
Image Width: 96999 Image Length: 45667
Tile Width: 256 Tile Length: 256
Resolution: 39525.7, 39525.7 pixels/cm
Bits/Sample: 8
Sample Format: unsigned integer
Compression Scheme: JPEG
Photometric Interpretation: YCbCr
Samples/Pixel: 3
Planar Configuration: single image plane
Software: OME Bio-Formats 7.1.0-SNAPSHOT
All valid datasets were imported into OMERO alongside the originals for visual validation
Overall, a large portion of the logic works as expected and the precompressed transfer results in a noticeable speed-up of the conversion time in some scenarios.
From an API perspective, I think the API additions make sense. Essentially, the openBytes/saveBytes
APIs of the readers/writers are supplemented by openCompressedBytes/saveCompressedBytes
equivalent as well as a few utility functions through two new top-level interfaces. The usage of Java 8+ default implementation means the changes should remain backwards compatible.
The primary issue that came up from the functional testing happens when using the -precompressed
flag and mismatching compression schemes between the source and the target. Ideally this scenario should be detected and fall back to decompressing/recompressing the tiles as done for mismatching compressions e.g. LZW -> JPEG.
In addition to the above:
- the tile size seems to be reported as
0x0
in the command-line utility - the tile size needs to be specified as an input but this seems to have been the case before this PR
- the file size of the DICOM datasets created with pre-compressed tiles is consistent with the original files but differs from the files created with decompression/recompression. I assume this is mostly a reflection of the difference in different codec implementations?
Adding one final usability/RFE thought which might be outside the scope of the original scoping of this work. The conversion of 77917.svs
provides an interesting example as the original file (SVS) uses several compression schemes with a subset of them being available in the target format (DICOM). For an optimal conversion, one could almost imagine a more adaptive scenario where each series would be copied using either the original compression if the target format supports it or fallback to the compression specified by -compression xxx
if the target format does not support it.
Thanks for the detailed write-up, @sbesson.
As noted above, 30f1c3d should fix this.
I think that's fixed with 2b1d661 and 3b63973, but let me know if you still see a problem when re-testing.
Codec implementation and quality settings are going to be a factor; in this case, channel interleaving (or lack thereof) may be the biggest difference.
Definitely not comfortable with trying to do that for 7.1.0, but an issue to look at that in a future release would be completely fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the last commit, -tilex 256 -tiley 256
can now be ommitted from the conversion commands (both with and without -precompressed
) with this PR included and the conversion numbers look as follows:
Source | Bio-Formats | Compression | Pre-compressed | Execution time | Size | setId time |
---|---|---|---|---|---|---|
CMU-1.svs | 7.0.1 | JPEG | no | 238.378s | 109M | 1.613s |
CMU-1.svs | 7.0.1 | JPEG-2000 | no | 665.698s | 539M | 4.749s |
77917.svs | 7.0.1 | JPEG | no | 1252.47s | 815M | 11.341s |
77917.svs | 7.0.1 | JPEG-2000 | no | 3097.387s | 5.0G | 39.501s |
CMU-1.svs | 7.1.0-SNAPSHOT | JPEG | no | 239.225s | 109M | 1.516s |
CMU-1.svs | 7.1.0-SNAPSHOT | JPEG-2000 | no | 709.188s | 539M | 5.152s |
77917.svs | 7.1.0-SNAPSHOT | JPEG | no | 1279.364s | 815M | 9.395s |
77917.svs | 7.1.0-SNAPSHOT | JPEG-2000 | no | 3191.433s | 5.0G | 43.525s |
CMU-1.svs | 7.1.0-SNAPSHOT | JPEG | yes | 85.224s | 176M | 2.422s |
CMU-1.svs | 7.1.0-SNAPSHOT | JPEG-2000 | yes | 684.453s | 539M | 5.892s |
77917.svs | 7.1.0-SNAPSHOT | JPEG | yes | 1247.511s | 815M | 11.32s |
77917.svs | 7.1.0-SNAPSHOT | JPEG-2000 | yes | 259.626s | 674M | 10.589s |
All generated DICOM datasets can now successfully be loaded into OMERO and visually validated.
The conversion log only displays a single warning per resolution
[Aperio SVS] -> 7.1.0-SNAPSHOT/jpeg_precompressed/77917/77917.dcm [DICOM]
More than 4GB of pixel data, compression will need to be used
Tile size = 256 x 256
Decompressing resolution: series=0, resolution=0
Series 0: converted 1/1 planes (100%)
Tile size = 256 x 256
Decompressing resolution: series=0, resolution=1
Series 0: converted 1/1 planes (100%)
Tile size = 256 x 256
Decompressing resolution: series=0, resolution=2
Series 0: converted 1/1 planes (100%)
Tile size = 256 x 256
Decompressing resolution: series=0, resolution=3
Series 0: converted 1/1 planes (100%)
Tile size = 256 x 256
Decompressing resolution: series=0, resolution=4
Series 0: converted 1/1 planes (100%)
Decompressed tile: series=1, resolution=0, x=0, y=0
Series 1: converted 1/1 planes (100%)
Decompressed tile: series=2, resolution=0, x=0, y=0
Series 2: converted 1/1 planes (100%)
[done]
1247.511s elapsed (43.714287+177604.58ms per plane, 719ms overhead)
...
[Aperio SVS] -> 7.1.0-SNAPSHOT/j2k_precompressed/77917/77917.dcm [DICOM]
More than 4GB of pixel data, compression will need to be used
Tile size = 256 x 256
Series 0: converted 1/1 planes (100%)
Tile size = 256 x 256
Series 0: converted 1/1 planes (100%)
Tile size = 256 x 256
Series 0: converted 1/1 planes (100%)
Tile size = 256 x 256
Series 0: converted 1/1 planes (100%)
Tile size = 256 x 256
Decompressing resolution: series=0, resolution=4
Series 0: converted 1/1 planes (100%)
Decompressed tile: series=1, resolution=0, x=0, y=0
Series 1: converted 1/1 planes (100%)
Decompressed tile: series=2, resolution=0, x=0, y=0
Series 2: converted 1/1 planes (100%)
[done]
259.626s elapsed (29.428572+36507.715ms per plane, 613ms overhead)
Also confirmed that combining any of the incompatible bfconvert
options throw an informative error as expected.
Also successfully executed the two example under formats-gpl/utils
usign CMU-1.svs
as the input file.
Barring the minor documentation issue found in bfconvert
, these changes look ready to merge based on the functional review above.
Following up on the conversation in #3992 (comment), I would propose the following progression:
-
assuming approval from @dgault and @joshmoore, include these changes i.e. new APIs + reader implementation + writer implementation in preparation of the upcoming 7.1.0 release
-
review the Bio-Formats developer documentation to also include these APIs when developers new readers/writers
-
capture the various RFEs that appear as part of the testing as issues and discuss their prioritisation in upcoming patch/minor releases at upcoming Formats meeting:
- discovery of the format with compressed tile reading/writing support
- extension to other reader/writers. Immediate candidates from my side might be:
OMETIFF{Reader,Writer}
and possiblyFakeReader
to facilitate the testing - depending on the above, possibly migration of the examples of the API to https://github.com/ome/bio-formats-examples where they are regularly executed
- add support for writing series with different compressions and use the source codec when possible
-
also include these new compressed reading/writing API as part of the investigation in Chunk API: Add new API and functionality for reading and writing chunks #3794
👍 to the roadmap outlined in #3992 (review) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im happy enough with all of the new API additions here. The reader and writer implementations also look fine from my side.
I did some manual sanity testing similar to Seb's more comprehensive tests, using our public sample SVS files (https://downloads.openmicroscopy.org/images/SVS/) along with the same OpenSlide files as Seb had used (https://openslide.cs.cmu.edu/download/openslide-testdata/Aperio/). Using bfconvert with the new option and the execution times with the new precompressed option were much faster as expected. The resulting files were also valid and open and displayed correctly.
The PrecompressedConversionExample The new preCompressedExample also looks good and runs as expected with a suitable SVS input file. The compressed tile times look to faster and after decompression the lengths match that of the rawTile. These may be better suited in the examples repo but that can happen at a later date.
Agreed with the conclusion of #3992 (review) that the documentation will need reviewed and updated to cover this functionality and the options for ImageConverter. But overall this looks good for inclusion with 7.1.0
byte[] buf = getTile(reader, writer.getResolution(), index, | ||
xCoordinate, yCoordinate, width, height); | ||
|
||
// if we asked for precompressed tiles, but that wasn't possible, | ||
// then log that decompression/recompression happened | ||
// TODO: decide if an exception is better here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now having it logged is a better option than throwing the exception
Follow-up issues for future discussion: |
Includes changes in #3911, for easier testing. https://github.com/melissalinkert/bioformats/compare/dicom-dual-personality...melissalinkert:bioformats:precompressed-tiles?expand=1 shows difference between what is here and what is in #3911.
This adds interfaces (
ICompressedTileReader
,ICompressedTileWriter
) and limited implementation (SVSReader
,DicomWriter
) so that compressed data stored in an input dataset can be directly transferred to an output dataset with decompressing and recompressing. The-precompressed
option inbfconvert
turns this feature on, when supported by the input and output formats.components/formats-gpl/utils/PrecompressedExample.java
andcomponents/formats-gpl/utils/PrecompressedConversionExample.java
are also simpler examples of how the API is meant to work.Most of the reader wrappers explicitly prohibit use of precompressed tiles, as most perform operations on the pixel data.
Right now, the only supported input format is SVS. I have only tested with JPEG input data so far, JPEG-2000 is next. The only supported output format is DICOM, with sequential writing (
TILED_FULL
) and JPEG compression. Attempting to convert any other combination of formats using precompressed tiles should throw an exception.I would expect both of these commands to produce a readable DICOM dataset:
bfconvert -noflat -precompressed -compression JPEG CMU-1.svs test.dcm
javac PrecompressedConversionExample.java && java -mx1g PrecompressedConversionExample CMU-1.svs test-example.dcm
More testing with a wider range of SVS datasets is required, and on my to-do list along with JPEG-2000 support as noted above.
There are a number of rough edges and places for improvement, for which opinions or suggestions are welcome:
FormatTools.canUsePrecompressedTiles(...)
andDicomWriter.saveCompressedBytes(...)
.ICompressedTileReader.openCompressedBytes(...)
takes XY indexes in terms of the tile column and row index - not pixel XY coordinates. This was meant to indicate that the reader's native tile layout has to be respected, and an arbitrary region of the image can't be requested. This is inconsistent with all other tile handling methods inIFormatReader
andIFormatWriter
, though.bfconvert
and the examples should switch between precompressed and "normal" tile conversion as needed (provided the reader/writer allow for precompressed tiles in the first place), but there is definitely room for improvement.DicomWriter.saveCompressedBytes(...)
has a note about this, as label/macro/other extra images aren't meant to be saved tile-wise.DicomWriter.setTileSizeX(int)
also has a note about this, as the tile size can only be set once for the whole dataset right now. We probably want to be able to pick different tile sizes for each pyramid resolution.bfconvert -precompressed ...
should warn if a particular image or tile was converted using normal decompression/recompressionsaveCompressedBytes(...)
match what the writer expects, both in terms of compression type and dimensions. I don't know of a great way to defend against storing tiles that are the wrong size, JPEG-2000 when they should be JPEG, etc. There is also a potential issue of the last row and column being a bit weird - if the reader handles tile padding differently from how the writer expects it, that could result in converted images that look wrong.As the draft status suggests, this is not ready for merging. I expect to add more commits here based on the notes above and any comments.
/cc @fedorov, @dclunie for feedback