Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes reading a large file #71

Closed
jonwright opened this issue Nov 9, 2021 · 6 comments
Closed

Crashes reading a large file #71

jonwright opened this issue Nov 9, 2021 · 6 comments

Comments

@jonwright
Copy link

I am assuming this is the project behind the wonderful thing I found yesterday that lets me browse hdf5 files in jupyterlab? It looks fantastic. I wish I could figure out how to select x and y axes for a plot? I always see data versus point number. The rest of the message is a bug report for how I seem to have broke something already (sorry!) :

Describe the bug

jupyterlab crashes when reading large dataset, perhaps an out of memory error?

To Reproduce

1 - Log into jupyter-slurm.esrf.fr with one single core and the lab interface
2 - Navigate to open : /data/id11/nanoscope/blc12407/id11/CeO2_38keV/CeO2_38keV_CeO2_rotation/CeO2_38keV_CeO2_rotation.h5
3 - open dataset /1.1/measurement/eiger : it displays
4 - open dataset /1.1/measurement/fpico6 : it displays
5 - go back to /1.1/measurement/eiger : jupyterlab stops running
6 - all the other tabs and kernels appear to exit when jupyterlab fails

Expected behaviour

In the worst case, a plugin would crash without taking down all of the other kernels. Ideally it would not crash.

Is there a way to use hdf5 slice operations (maybe combined with fast histograms) so you only hold in memory what is going to be displayed on the screen (e.g. maximum data is a 2D image)? Then libhdf5 should manage the memory cache in some sensible way.

Context

  • OS: ubuntu20.04
  • Browser: Chrome
  • Version: 94.0.4606.81
  • JupyterLab version: 2.3.1 (ESRF slurm installation
Extension lists This is based on a bit of guesswork as to what is actually running when I use jupyter-slurm :
jupyter-slurm:~ % /scisoft/users/jupyter/jupy38ubuntu/bin/jupyter labextension list
JupyterLab v2.3.1
Known labextensions:
   app dir: /home/esrf/jupyter/jupy38ubuntu/share/jupyter/lab
        @jupyter-widgets/jupyterlab-manager v2.0.0  enabled  OK
        jupyter-matplotlib v0.7.4  enabled  OK
        jupyter-threejs v2.2.0  enabled  OK
        jupyterlab-datawidgets v6.3.0  enabled  OK
        jupyterlab-h5web v0.0.10  enabled  OK
        k3d v2.9.3  enabled  OK
jupyter-slurm:~ % /scisoft/users/jupyter/jupy38ubuntu/bin/jupyter serverextension list
config dir: /home/esrf/jupyter/jupy38ubuntu/etc/jupyter
    jupyterlab_h5web  enabled 
    - Validating...
      jupyterlab_h5web  OK
    jupyterlab  enabled 
    - Validating...
      jupyterlab 2.3.1 OK
    jupyterlab_hdf  enabled 
    - Validating...
      jupyterlab_hdf 0.5.1 OK
    jupyter_nbextensions_configurator  enabled 
    - Validating...
      jupyter_nbextensions_configurator 0.4.1 OK
@loichuder
Copy link
Member

loichuder commented Nov 9, 2021

Hello Jon, thanks for trying the extension and for the feedback !

Axis selection

I wish I could figure out how to select x and y axes for a plot?

Well, h5web is a "dumb" viewer: it will only display visualizations corresponding to the content of the file. It is not meant to be a visualization tool.
The only way to select x and y axes for a plot would be to use a NXData group with an attribute axesas the NeXus standard is supported by h5web.

Reasons of the crash when reading a large dataset

This is due to a limitation in the Line visualisation: we have a feature (auto-scale off) where the axis limits are set to the limits of the full dataset. As a consequence, when using the Line, h5web fetches the full dataset. In this case, I believe this is around 256 GB (:scream:) making the whole Jupyter server crash. I still need to investigate the exact reason.

Note that the Heatmap suffers not from this limitation: it only fetches the slice. This is why the first display of /1.1/measurement/eiger works. It is the switch to the 1D dataset /1.1/measurement/fpico6 that make h5web switch to the Line visualisation when coming back to /1.1/measurement/eiger.

What is next, then?

Is there a way to use hdf5 slice operations (maybe combined with fast histograms) so you only hold in memory what is going to be displayed on the screen (e.g. maximum data is a 2D image)?

It would indeed make sense to fetch only the slice even for a Line visualization. The Auto-scale feature puts a large limitation for large datasets and we need to work somehow around that.

We have an issue in h5web where we track our ideas and improvements to fetch large datasets: silx-kit/h5web#616. The discussion about the auto-scale will surely continue there and any implementation fixing the crash will be mentioned there.

In the mean time, use the Heatmap ? 😅

@andygotz
Copy link

@jonwright thanks for the +ve feedback.

@loichuder thanks for the explanations. It seems like we are missing a tool to do flexible viewing of Nexus files i.e. selecting what to display against what. AM I right to say that users have to build their own tool with a mixture of h5py and matplotlib for now? Does bragy address this?

@axelboc
Copy link
Contributor

axelboc commented Nov 10, 2021

This is outside of the scope of Braggy, for sure. It's always possible to make a new GUI, but note that a solution to this problem is to generate a NeXus-compliant HDF5 file with external links to the relevant datasets, and then open this file in H5Web. Obviously not as practical as a GUI, but we could easily provide Python utilities to make generating this sort of file a breeze (perhaps these utilities already exist, even).

@t20100
Copy link
Member

t20100 commented Nov 10, 2021

There is already some helpers to save NXData: nexusformat or silx.io.nxdata.save_NXdata.

Otherwise since this runs in a notebook, using matplotlib or any other plot library is probably best suited for tailored plots if not saved as NXData.

BTW, in silx view, there is a feature to create "virtual NXData" by dragging and dropping datasets as signal and axes, but to me it is a bit complex since one needs to know about NeXus to use it.

@loichuder
Copy link
Member

Following on the crash issue, we have something in the works to solve it: silx-kit/h5web#616 (comment)

I will close this once this is shipped in a jupyterlab-h5web release.

@loichuder
Copy link
Member

silx-kit/h5web#616 (comment) was integrated in v0.1.0 that is now deployed in jupyter-slurm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants