Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .seq format for DE 16 and Celeritas Camera #11

Open
wants to merge 114 commits into
base: main
Choose a base branch
from

Conversation

CSSFrancis
Copy link
Member

@CSSFrancis CSSFrancis commented Aug 15, 2022

Description of the change

This adds in support for reading the DE 16 and Celeritas cameras.

Some notes about the file format:
DE 16:

  • The DE 16 camera reads out to multiple files. A metadata file, dark, gain and a .seq file. These files all have the same naming scheme so I read all files in the same folder with the same naming scheme or allow for directly passing the files.
  • The data is in a binary format with each frame at some offset and a time stamp following the frame.

*Celeritas

  • Due to the speed at which this camera reads out data the camera is split in two a "top" and a "bottom" frame are both read concurrently.
  • These frames are also saved in a buffer. With multiple images saved in a big long image.
    • This makes memory mapping this dataset a little bit harder as there isn't a constant stream of data, I would like to add support for using the distributed scheduler but that might have to wait.
    • This buffer is saved in the XML file alongside the data. There may be a way to guess this buffer if given the XML file and the FPS of the camera.
    • The time stamp is only recorded once every buffer.
  • etc.

Progress of the PR

  • Added De 16 support for loading
    - [ ] Add DE 16 support for saving (Potentially?)
  • Added Celeritas support
  • Add support for DE 16 using the distributed scheduler
  • Add support for Celeritas using the distributed scheduler
  • update docstring (if appropriate),
  • update user guide (if appropriate),
  • add an changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
  • Check formatting changelog entry in the readthedocs doc build of this PR (link in github checks)
  • add tests for basic loading
  • ready for review.

Minimal example of the bug fix or the new feature

from rsciio.de import api
api.file_reader("test.seq") # read regular .seq

api.file_reader("test_Top_.seq", celeritas=True) # read celeritas .seq

@codecov
Copy link

codecov bot commented Aug 15, 2022

Codecov Report

Patch coverage: 90.16% and project coverage change: +0.20 🎉

Comparison is base (b045157) 84.95% compared to head (a340725) 85.15%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #11      +/-   ##
==========================================
+ Coverage   84.95%   85.15%   +0.20%     
==========================================
  Files          73       75       +2     
  Lines        8894     9250     +356     
  Branches     1955     2022      +67     
==========================================
+ Hits         7556     7877     +321     
- Misses        876      895      +19     
- Partials      462      478      +16     
Impacted Files Coverage Δ
rsciio/de/_api.py 89.77% <89.77%> (ø)
rsciio/utils/tools.py 80.26% <90.90%> (+6.93%) ⬆️
rsciio/de/__init__.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@CSSFrancis
Copy link
Member Author

@sk1p I know we have talked about adding support for the DE Celeritas camera to liberTEM and hyperspy. If you have the chance can you look over this PR? The hardest thing is dealing with the Segment prebuffer for the celeritas camera.

I wanted to add support for distributed scheduling using the scheme proposed by @uellue here but due to the nature of the prebuffer the data isn't evenly spaced in the binary file. This makes implementing this in a general way fairly difficult.

@sk1p
Copy link
Contributor

sk1p commented Aug 15, 2022

I know we have talked about adding support for the DE Celeritas camera to liberTEM and hyperspy. If you have the chance can you look over this PR? The hardest thing is dealing with the Segment prebuffer for the celeritas camera.

I can have a look - I'd also like to try this with real data, did you manage to upload some to the drop link I gave you some time ago?

In general, what is this project's stance on testing with real input data? It could be possible to publish a set of (small-ish) reference data sets on i.e. zenodo and download those in CI runs.

I wanted to add support for distributed scheduling using the scheme proposed by @uellue here but due to the nature of the prebuffer the data isn't evenly spaced in the binary file. This makes implementing this in a general way fairly difficult.

Yeah - in case of uneven spacing, it's probably required to do a sparse search pass over the data, for example by reading the image headers at N positions in the whole data set, and mapping out where it can be split - if I understood you correctly. Or is the coarse structure evenly spaced, i.e. it's possible to calculate offsets to images just from their index?

Anyways, instead of just a straight mmap, there would need to be a function that decodes whatever is in the file to a numpy array. That's also something needed for quite many other formats, i.e. FRMS6, binary MIB, ...

@CSSFrancis
Copy link
Member Author

CSSFrancis commented Aug 15, 2022

I can have a look - I'd also like to try this with real data, did you manage to upload some to the drop link I gave you some time ago?

Right now the data is all hosted in the tests/de_data/celeritas_data folder. There are smallish (1-20 mb) datasets collected using a couple of different camera modes. These are probably the best data used for testing.

In general, what is this project's stance on testing with real input data? It could be possible to publish a set of (small-ish) reference data sets on i.e. zenodo and download those in CI runs.

We try to test with real input data as often as we can. That being said the data is included with the package and it might be better to host that somewhere else eventually. I was meaning to create an Issue regarding this.

Yeah - in case of uneven spacing, it's probably required to do a sparse search pass over the data, for example by reading the image headers at N positions in the whole data set, and mapping out where it can be split - if I understood you correctly. Or is the coarse structure evenly spaced, i.e. it's possible to calculate offsets to images just from their index?

So the data is structured like this:
Seq Scheme
So its not quite uneven, but the images are saved in chunks. You can calculate the image offset if you know the number of images in a buffer.

Anyways, instead of just a straight mmap, there would need to be a function that decodes whatever is in the file to a numpy array. That's also something needed for quite many other formats, i.e. FRMS6, binary MIB, ...

Any examples of how you do this? Can you just create function that maps a frame to a offset in the data and then just apply it?

@jlaehne
Copy link
Contributor

jlaehne commented Aug 16, 2022

Right now the data is all hosted in the tests/de_data/celeritas_data folder. There are smallish (1-20 mb) datasets collected using a couple of different camera modes. These are probably the best data used for testing.

The longterm idea is to host the files in the repo, but to exclude them from the installation, where they would just be downloaded on demand. I don't remember the name of the package that can do this @ericpre . But would be a good idea to create an issue to put it on the todo.

rsciio/de/api.py Outdated Show resolved Hide resolved
Comment on lines 69 to 111
def parse_xml(file):
try:
tree = ET.parse(file)
xml_dict = {}
for i in tree.iter():
xml_dict[i.tag] = i.attrib
# clean_xml
for k1 in xml_dict:
for k2 in xml_dict[k1]:
if k2 == "Value":
try:
xml_dict[k1] = float(xml_dict[k1][k2])
except ValueError:
xml_dict[k1] = xml_dict[k1][k2]
except FileNotFoundError:
_logger.warning(
msg="File " + file + " not found. Please"
"move it to the same directory to read"
" the metadata "
)
return None
return xml_dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have this as a generic utility function? The flattening, cleaning and conversion performed here seems to be specific to the DE metadata XML format. Any reason not to re-use convert_xml_to_dict instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that convert_xml_to_dict doesn't cope well with human-readable XML, which can have both a .text and child nodes, so that may need fixes before it can be used.

rsciio/de/api.py Outdated
ImageBitDepth: int
The bit depth of the image. This should be 16 in most cases
TrueImageSize: int
The size of each frame buffersin bytes. This includes the time stamp and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The size of each frame buffersin bytes. This includes the time stamp and
The size of each frame buffers in bytes. This includes the time stamp and

rsciio/de/api.py Outdated
Comment on lines 549 to 772
top_mapped = np.memmap(top, offset=offset, dtype=dtypes, shape=total_buffer_frames)
bottom_mapped = np.memmap(
bottom, offset=offset, dtype=dtypes, shape=total_buffer_frames
)

if lazy:
top_mapped = da.from_array(top_mapped)
bottom_mapped = da.from_array(bottom_mapped)

array = np.concatenate(
[
np.flip(
top_mapped["Array"].reshape(-1, *top_mapped["Array"].shape[2:]), axis=1
),
bottom_mapped["Array"].reshape(-1, *bottom_mapped["Array"].shape[2:]),
],
1,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any examples of how you do this? Can you just create function that maps a frame to a offset in the data and then just apply it?

I think this was the main point you were asking about. It's more of a mapping from a chunk slice to an array. Each array chunk is created from both the top and bottom memory map, which are only created inside of the delayed function. To structure this according to the dask docs on memory mapping, it could look like this (sketch, untested):

def mmap_load_chunk(top, bottom, shape, dtype, offset, sl):
    top_map = np.memmap(top, mode='r', shape=shape, dtype=dtype, offset=offset)["Array"]
    top_flat = top_map.reshape(-1, *top_map.shape[2:])
    top_sliced = top_flat[sl]
    bottom_map = np.memmap(bottom, mode='r', shape=shape, dtype=dtype, offset=offset)["Array"]
    bottom_flat = bottom_map.reshape(-1, *bottom_map.shape[2:])
    bottom_sliced = bottom_flat[sl]
    return np.concatenate([
        np.flip(top_sliced, axis=1),
        bottom_sliced,
    ], 1)

(with mmap_dask_array adjusted accordingly)

Does this make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the np.concatenate+np.flip operation does touch a sizable chunk of data, it may be efficient to replace it with a numba function that also inlines the application of dark_img/gain_img in addition to flipping/concatenation for cache efficiency.

…irectly point to files instead of using glob.
# Conflicts:
#	docs/supported_formats/de.rst
#	rsciio/de/specifications.yaml
#	rsciio/tests/test_de.py
#	rsciio/utils/tools.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants