Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add topostats file helper class #945

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

SylviaWhittle
Copy link
Collaborator

This PR would add a small helper class in topostats.io to assist users that want to explore & retrieve data contained in .topostats files, as we have had feedback from the experimentalists that navigating the .hdf5 file structure is prohibitively complex / difficult to do manually.

Previously, to load the file in a notebook, one had to:

from pathlib import Path

import h5py

from topostats.io import hdf5_to_dict


file = Path("./path/to/file.topostats")
with h5py.File(file, "r") as f:
    data_dict = hdf5_to_dict(f, "/")

# Then try to manually navigate the dictionary to find the specific item wanted
data = data_dict["ordered_trace_heights"]["0"]
# get the keys wrong
>>> ValueError
# manually print keys at each level, akin to doing lots of ls, cd
print(data_dict.keys())
data = data_dict["grain_trace_data"]
print(data.keys())
.
.
.

The TopoFileHelper class adds some methods to help with this:

  • pretty_print_structure() will print the entire structure (but not messy dictionaries of arrays!):
[./tests/resources/file.topostats]
├ filename
│   └ minicircle
├ grain_masks
│   └ above
│       └ Numpy array, shape: (1024, 1024), dtype: int64
├ grain_trace_data
│   └ above
│       ├ cropped_images
│       │   └ 21 keys with numpy arrays as values
│       ├ ordered_trace_cumulative_distances
│       │   └ 21 keys with numpy arrays as values
│       ├ ordered_trace_heights
│       │   └ 21 keys with numpy arrays as values
│       ├ ordered_traces
│       │   └ 21 keys with numpy arrays as values
│       └ splined_traces
│           └ 21 keys with numpy arrays as values
├ image
│   └ Numpy array, shape: (1024, 1024), dtype: float64
├ image_original
│   └ Numpy array, shape: (1024, 1024), dtype: float64
├ img_path
│   └ /Users/sylvi/Documents/TopoStats/tests/resources/minicircle
├ pixel_to_nm_scaling
│   └ 0.4940029296875
└ topostats_file_version
    └ 0.2
  • find_data() will perform a strict search for the keys (given in a list) and if no match is found, perform a partial search to find possible matches that the user intended. Eg:
topofilehelper.find_data(["ordered_trace_heights", "0"])
 [ Searching for ['ordered_trace_heights', '0'] in ./tests/resources/file.topostats ]
 | [search] No direct match found.
 | [search] Searching for partial matches.
 | [search] !! [ 1 Partial matches found] !!
 | [search] └ grain_trace_data/above/ordered_trace_heights/0
 └ [End of search]
  • get_data() Simply retries data when provided with a key string separated by "/"s:
ordered_trace_heights = topofilehelper.get_data("grain_trace_data/above/ordered_trace_heights/0")
  • data_info() Prints a little information about the value at a specific key:
topofilehelper.data_info("grain_trace_data/above/ordered_trace_heights/0")
topofilehelper.data_info("grain_trace_data/above/ordered_trace_heights")
Data at grain_trace_data/above/ordered_trace_heights/0 is a numpy array with shape: (95,), dtype: float64
Data at grain_trace_data/above/ordered_trace_heights is a dictionary with 21 keys of types {<class 'str'>} and values of types {<class 'numpy.ndarray'>}

No tests yet, would want feedback first

@ns-rse
Copy link
Collaborator

ns-rse commented Oct 15, 2024

Not ignoring this, have had a scan through and it looks good but focusing on the various outstanding issues with the better tracing merger.

@SylviaWhittle
Copy link
Collaborator Author

Not ignoring this, have had a scan through and it looks good but focusing on the various outstanding issues with the better tracing merger.

All good, not wanting this to take time away from more important stuff, it can wait and I want user opinions first too :)

@MaxGamill-Sheffield
Copy link
Collaborator

Can we work into this a notebook on opening and extracting data from the file too, to address the comments from the workshop day?

@SylviaWhittle
Copy link
Collaborator Author

Can we work into this a notebook on opening and extracting data from the file too, to address the comments from the workshop day?

ye

@SylviaWhittle
Copy link
Collaborator Author

Added a notebook showing how to use the class in ./notebooks/

Examples
--------
Creating a helper object.
```python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about how this renders when Sphinx's Autoapi-doc parses it to generate API docs in the webpage as docstrings are Restructured text. Might be worth using the .. code-block:: approach, see @MaxGamill-Sheffield solution in commit b953634 .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point thank you, I'll do that

@ns-rse
Copy link
Collaborator

ns-rse commented Oct 17, 2024

I think the tests under Python 3.9 failing because under that version we need the following import...

from __future__ import annotations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Notebook would be easier to read if the comments were moved to Markdown sections to delineate the code examples. Otherwise may as well just include a codeblock in docs/advanced/topostats_file_helper.md and be a simpler solution to documentation as its a web-page people can go to, they wouldn't need to activate a virtual environment and then start a Jupyter. Code chunks could be copy and pasted (I'd have to work out how to enable a button to support that though).

@ns-rse
Copy link
Collaborator

ns-rse commented Oct 30, 2024

Just came across h5glance after the Skan developer mentioned it on Mastodon. I wonder if using this would be a simpler solution, it sounds as though it might work within Jupyter Notebooks too.

@SylviaWhittle
Copy link
Collaborator Author

SylviaWhittle commented Nov 24, 2024

Just came across h5glance after the Skan developer mentioned it on Mastodon. I wonder if using this would be a simpler solution, it sounds as though it might work within Jupyter Notebooks too.

Good find!

It does work Wonderfully in notebooks and it've even interactive! You can click on the items to expand / hide sub-fields.
image

However this is just for looking at hdf5 files and not retrieving data from them (which would still require the with h5py.File('testfile.hdf5', 'r') as f: <load stuff manually>)

I propose that I replace the code I wrote to display the contents of the file with h5glance but keep the methods I wrote to do data retrieval?

@ns-rse
Copy link
Collaborator

ns-rse commented Nov 25, 2024

Sounds like a plan.

From memory these are thin wrappers around being able to access dictionary items directly and I feel that its substituting learning how to work with dictionaries directly with learning how to use the wrappers. I'm of the opinion that the more general skill (working directly with dictionaries) has broader benefits to users in the long term.

⚖️

@SylviaWhittle
Copy link
Collaborator Author

There is an issue with h5glance, that it must be called from within the notebook. It cannot be called in a wrapper in a standard .py file. If this happens then this is the result:

image

I tried this:

    def pretty_print_structure(self) -> None:
        """
        Print the structure of the data in the data dictionary.

        The structure is printed with the keys indented to show the hierarchy of the data.
        """
        LOGGER.info(f"running h5glance")
        H5Glance(self.topofile)

and this:

    def pretty_print_structure(self) -> None:
        """
        Print the structure of the data in the data dictionary.

        The structure is printed with the keys indented to show the hierarchy of the data.
        """
        LOGGER.info(f"running h5glance")
        result = H5Glance(self.topofile)
        print(result)

and neither produce the nested output needed.

If users want the interactive h5glance notebook UI I screenshotted earlier, they must call it explicitly in a notebook, h5glance.H5Glance("./file.h5").

So either they must remember h5glance as a separate tool to TopoFileHelper or we could provide my worse implementation as a built-in alternative in case they don't remember to use h5glance?

@ns-rse
Copy link
Collaborator

ns-rse commented Dec 3, 2024

I guess it depends how people are using the .topostats files? Are Notebooks widely used within the group and outside of it to explore .topostats files?

I'm not a great fan of re-inventing the 🛞 and if a tool exists I'll tend to advocate for its use over making something new which adds to our codebase and the overhead of maintenance.

As a general rule though, and perhaps I'm missing something, but hdf5 files once loaded are essentially dictionaries. A wrapper to make exploring dictionaries is perhaps useful but it still requires learning how to use the wrapper. I'd personally advocate for helping people learn how to explore dictionaries as it gives them a transferable skill that can be used in lots of other scenarios.

I think I recall writing such when I wrote original Notebooks (see for example the line after from topostats.io import read_yaml in this notebook). We could point people to tutorials on how to use .keys() and .values() and iteration over dictionaries.

The h5glance README points to using the h5py package when wanting to work with such files within code but then that is what you are using here.

Having re-read the Examples you've written it looks like it is demonstrating to people how to use the helper to view files in a Notebook, which they could do with h5glance and its one less thing to maintain within TopoStats ⚖️

@MaxGamill-Sheffield
Copy link
Collaborator

Plan from the TopoStats code clean 08/01/25 is:

  • add as a module / in a module.
  • remove key finder function in favour of h5glance
  • take notebook from here and:
    • add docs and a section on h5glance
    • add docs and a section on the func that pulls the values
    • remove the helical periodicity stuff

@ns-rse
Copy link
Collaborator

ns-rse commented Jan 8, 2025

Sorry to miss code clean, was embroiled in some bioinformatics work and didn't notice the time.

With regards to Notebook this might be an ideal opportunity to migrate to the newer marimo which among other things has a major advantage of updating all dependent cells when an earlier one is re-run.

For more on the problems marimo solves see the faq.

I doubt there will be much of an overhead in migrating since its still a notebook running the cells so both Markdown and code cells could be copied over. There is a section on migrating from Jupyter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants