Add option to return Reader for PixelData element #224

ematvey · 2021-12-02T16:30:38Z

We currently are using DCMTK to

read certain meta tags;
write anonymised image by replacing PatientBirthDate and PatientName with non-identifiable value.

Current issue is 1+ Gb DICOMs which encode a large sequence of frames. It causes an unreasonable amount of CPU and RAM to read PixelDatas from such DICOMs, simply to write them to an anonymised file without any processing.

This library could be a great fit for anonymisation workflow if, for example, it would allow to It would be great if this library supported returning an io.Reader for the whole PixelData element, so that non-image elements could be interpreted but image elements could be streamed to output file without change.

The text was updated successfully, but these errors were encountered:

suyashkumar · 2021-12-08T23:51:35Z

Interesting use case! I'm not sure if we could do this safely without leaving the DICOM file/input io.Reader open. E.g. you'd call parse, and under this option parse would finish, but one of the elemnts would have an io.Reader to read in the whole pixeldata? Maybe this could be done on demand (e.g. with a file offset stored in that element, so when you read from the reader, we reopen the file and start fetching the reader the PixelData bytes only--this gets complex when input isn't a file, but a generic io.Reader, which is what the current API is)?

Or maybe with this option, the input io.Reader is open till the user exhausts the PixelData io.Reader?

I think it could be workable, but a bit complex with some weird edge cases I'll need to think about.

One far out there option could be (under this special hypothetical option) to allow the user to register a special processing function for PixelData (or maybe any tag). Maybe something like

RegisterProcessingOption(tag.PixelData, func(r dicomio.Reader) error {
    // You could have a closure over your output file handle here I suppose...but would still be a little weird for your usecase. 
})

Not sure if I love this though, and I don't think we could allow it for just any tag, as some elements depend on prior elements to parse correctly.

Another option that would save you the CPU (but not the memory) is an option that just copies the PixelData bytes into memory without any processing at all, and you could grab it from that special tag during your workflow.

Also, note that this library does support an element by element API with ParseNext() (and a streaming API for PixelData Frames), but still maintains a history of all elements read so far in memory (unfortunately, this is necessary in some cases to "lookback" at previous elements, namely for PixelData and I think some Sequence parsing). It's possible this could be revisited carefully in a "memory light" option / case, but I'll need to think more about what that would look like.

Finally, and I'm sure you know this already, but if you're doing a true anonymization workflow, you probably would want to look at the images, because sometimes identifiable patient information can be in the images.

suyashkumar · 2021-12-09T00:05:17Z

Feel free to suggest an API if you had something specific in mind. Sounds like you were looking for something like I initially mentioned--some API where the returned element from Parser.Next() has an io.Reader to the PixelData you can use to dump the data to your output file. As mentioned, closing the input correctly in a system like this would be something to think about. In situations where the element-by-element Parser API is used directly, I think we could just leave that up to the user. There are some edge cases though, like what if some other thread advances the input reader the PixelData reader was supposed to read from, etc so I'm not sure how safe this is.

Probably easiest thing to do for your case in short term is:

at least save some CPU via an option to not parse the PixelData and just read the whole thing into an element buffer
explore a memory-light option where we don't hold the full dataset in memory, and only hold the tags that we need to possibly refer back to later for parsing downstream elements. If the user wanted, after calling Parse.Next() they could throw away the element, and move onto the next. (This is something I was thinking about anyway in the past as an optimization).
and as mentioned above, if you're doing a sensitive anonymization workflow, you should probably also consider looking at the images as some do have identifiable information "burned in" to the images.

suyashkumar · 2021-12-17T18:01:53Z

I saw there was a PR I think that may be related to this, and put some more thoughts there as well: #223 (review)

suyashkumar · 2023-02-20T17:55:39Z

We did introduce a SkipPixelData option, which skips the PixelData entirely (doesn't even load it into memory).

However, for a use case like this where some tags are modified and others are to be written back identically, we can maybe introduce another option like dicom.NoParseValue(tag1, tag2, tag3) which will still read in the data into the Dataset, but will try to do as little parsing/processing of the value as possible to save CPU. In some cases it may be trickier to not do any parsing at all (e.g. in nested sequences etc), but at least for Native PixelData there would be a speed up.

We can start by implementing support for PixelData first, and then go from there to possibly supporting other tags...

WDYT?

suyashkumar · 2023-02-20T19:25:51Z

Hi there! I have a draft PR #256 that addresses this to some degree--it skips the processing of NativePixelData in favor of just blindly loading the data into memory (so that it can be roundtripped out later).

Ultimately I think the right long term solution might be to instead have an option that indicates only certain elements should be processed, or a lazy load type of situation. However, for now, it seems like this is really most important for PixelData so I figure it is good to start there.

…rving roundtrip read/write (#256) This addresses _some_ of the discussion in #224. The SkipProcessingPixelDataValue option causes the PixelData bytes to be read into the element, but not processed any further. This option provides an option to save the CPU cycles processing NativePixel data but still gives a roundtrip-able Dataset. If you want to save both CPU and Memory, you can instead use the SkipPixelData option, which doesn't load the PixelData bytes into memory at all. In the future, we should consider an option to lazy load/process all Element values if possible.

suyashkumar mentioned this issue Dec 17, 2021

Consider Optimization to hold onto only essential elements in Parser #227

Open

suyashkumar mentioned this issue Feb 20, 2023

Add option to skip Element Value processing of PixelData, while preserving roundtrip read/write #256

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to return Reader for PixelData element #224

Add option to return Reader for PixelData element #224

ematvey commented Dec 2, 2021

suyashkumar commented Dec 8, 2021 •

edited

Loading

suyashkumar commented Dec 9, 2021 •

edited

Loading

suyashkumar commented Dec 17, 2021

suyashkumar commented Feb 20, 2023 •

edited

Loading

suyashkumar commented Feb 20, 2023

Add option to return Reader for PixelData element #224

Add option to return Reader for PixelData element #224

Comments

ematvey commented Dec 2, 2021

suyashkumar commented Dec 8, 2021 • edited Loading

suyashkumar commented Dec 9, 2021 • edited Loading

suyashkumar commented Dec 17, 2021

suyashkumar commented Feb 20, 2023 • edited Loading

suyashkumar commented Feb 20, 2023

suyashkumar commented Dec 8, 2021 •

edited

Loading

suyashkumar commented Dec 9, 2021 •

edited

Loading

suyashkumar commented Feb 20, 2023 •

edited

Loading