Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Karl5766 committed Aug 16, 2024
1 parent ea1ed55 commit b846450
Show file tree
Hide file tree
Showing 10 changed files with 445 additions and 66 deletions.
16 changes: 16 additions & 0 deletions docs/API/imfs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
.. _imfs:

cvpl_tools/im/fs.py
============================

View source at `fs.py <https://github.com/khanlab/cvpl_tools/blob/main/src/cvpl_tools/im/fs.py>`_.

.. rubric:: APIs

.. autofunction:: cvpl_tools.im.fs.save
.. autofunction:: cvpl_tools.im.fs.load
.. autofunction:: cvpl_tools.im.fs.display
.. autoclass:: cvpl_tools.im.fs.CachePath
:members:
.. autoclass:: cvpl_tools.im.fs.CacheDirectory
:members:
13 changes: 13 additions & 0 deletions docs/API/ndblock.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
.. _ndblock:

cvpl_tools/im/ndblock.py
============================

View source at `ndblock.py <https://github.com/khanlab/cvpl_tools/blob/main/src/cvpl_tools/im/ndblock.py>`_.

.. rubric:: APIs

.. autoclass:: cvpl_tools.im.ndblock.NDBlock
:members:
.. autoclass:: cvpl_tools.im.ndblock.ReprFormat
:members:
47 changes: 47 additions & 0 deletions docs/API/seg_process.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
.. _seg_process:

cvpl_tools/im/seg_process.py
============================

View source at `seg_process.py <https://github.com/khanlab/cvpl_tools/blob/main/src/cvpl_tools/im/seg_process.py>`_.

Q: Why are there two baseclasses SegProcess and BlockToBlockProcess? When I define my own pipeline, which class
should I be subclassing from?

A: BlockToBlockProcess is a wrapper around SegProcess for code whose input and output block sizes are the same.
For general processing whose output are list of centroids, or when input shape of any block is not the same as
output shape of that block, then BlockToBlockProcess is suitable for that purpose.

.. rubric:: APIs

.. autoclass:: cvpl_tools.im.seg_process.SegProcess
:members:
.. autoclass:: cvpl_tools.im.seg_process.BlockToBlockProcess
:members:
.. autofunction:: cvpl_tools.im.seg_process.lc_interpretable_napari

.. rubric:: Built-in Subclasses for SegProcess

.. autoclass:: cvpl_tools.im.seg_process.GaussianBlur
:members:
.. autoclass:: cvpl_tools.im.seg_process.BSPredictor
:members:
.. autoclass:: cvpl_tools.im.seg_process.SimpleThreshold
:members:
.. autoclass:: cvpl_tools.im.seg_process.BlobDog
:members:
.. autoclass:: cvpl_tools.im.seg_process.ScaledSumIntensity
:members:
.. autoclass:: cvpl_tools.im.seg_process.DirectBSToOS
:members:
.. autoclass:: cvpl_tools.im.seg_process.Watershed3SizesBSToOS
:members:
.. autoclass:: cvpl_tools.im.seg_process.BinaryAndCentroidListToInstance
:members:
.. autoclass:: cvpl_tools.im.seg_process.DirectOSToLC
:members:
.. autoclass:: cvpl_tools.im.seg_process.CountLCEdgePenalized
:members:
.. autoclass:: cvpl_tools.im.seg_process.CountOSBySize
:members:

42 changes: 42 additions & 0 deletions docs/GettingStarted/boilerplate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
.. _boilerplate:

Boilerplate Code
################

Each time we write a new script, we need to include some setup code that configures the dask library and other
utilities we need for our image processing pipeline. Below gives a brief description for the setup of these
utilities and how they can be used when we are writing our image processing pipelines with
cvpl_tools.im.seg_process.SegProcess class.

Dask Cluster and temporary directory
************************************

Dask is a multithreaded and distributed computing library, in which temporary results can not all be saved in
memory. When the intermediate results do not fit in memory, they are written in the temporary directory set in
the dask config's temporary_directory variable. When working on HPC system on Compute Canada, be careful that
this path is set to a /scratch directory where number of file allowed to be created is large enough.

Setting up the client is described in the `Dask quickstart <https://distributed.dask.org/en/stable/quickstart.html>`_
page. We will use local laptop as example.

.. code-block:: Python
import dask
import dask.config
import dask.array as da
with dask.config.set({'temporary_directory': TMP_PATH}):
client = Client(threads_per_worker=6, n_workers=1)
print((da.zeros((3, 3)) + 1).compute().sum().item()) # will output 9
CacheDirectory
**************

Different from Dask's temporary directory, cvpl_tool's CacheDirectory class provides intermediate result
caching APIs. A multi-step segmentation pipeline may produce many intermediate results, for some of them we
may compute once discard, and for the others (like the final output) we may want to cache them on the disk
for access later without having to redo the computation. In order to cache the result, we need a fixed path
that do not change across execution of the program. The **CacheDirectory** class is one that manages and
assigns paths for these intermediate results, based on their cache ID (cid) and the parent CacheDirectory
they belongs to.

8 changes: 4 additions & 4 deletions docs/GettingStarted/ome_zarr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,15 +79,15 @@ Above + denotes collapsed folder and - denotes expanded folder. A few things to
is not a standard ZARR directory and contains no **.zarray** meta file. Loading an OME ZARR
image as ZARR will crash, if you forget to specify **0/** subfolder as the path to load
- When saved as a zip file instead of a directory, the directory structure is the same except that
the root is a zip file. And when loading we can use ZipStore's features to directly reading
individual files without having to unpack the entire zip file. However, writing to a ZipStore is
not well supported by either Python's zarr or the ome-zarr library.
the root is zipped. Loading a zipped OME ZARR, cvpl_tools uses ZipStore's features to directly reading
individual chunks without having to unpack the entire zip file. However, writing to a ZipStore is
not supported, due to lack of support by either Python's zarr or the ome-zarr library.
- An HPC system like Compute Canada may work better with one large files than many small files,
thus the result should be zipped. This can be done by first writing the folder to somewhere
that allows creating many small files and then zip the result into a single zip in the target
directory
- As of the time of writing (2024.8.14), ome-zarr library's Writer class has a `double computation
issue <https://github.com/ome/ome-zarr-py/issues/392>`_ and to temporary patch this for our
issue <https://github.com/ome/ome-zarr-py/issues/392>`_. To temporary patch this for our
use case, I've added a **write_ome_zarr_image** function to write a dask array as an OME ZARR
file. This function also adds support for reading images stored as a **.zip** file.

Expand Down
188 changes: 188 additions & 0 deletions docs/GettingStarted/segmentation_pipeline.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
.. _segmentation_pipeline:

Segmentation Pipeline
#####################

Motivation: Microscope, Cell Counting, Atlas map and Object Segmentation
************************************************************************

In our use case, lightsheet microscopy of mouse brain produces several hundreds of GBs of
data, represented as a 3-dimensional array. This array is stored as an OME
ZARR file in our case. Distributed processing of the data is necessary to make the analysis time
trackable, and we choose to use Dask as the distributed computing library for our project.

As part of our research, we need to use automated method to count different types of visible objects in
the image. After counting cells, we use a map that maps from pixel location of
the image to brain regions (this is the **atlas map**) to obtain the density of cells in each region of
the mouse brain. We need to test several methods and find one that would give the most
accurate counts, and for new incoming data volumes, we want to be able to quickly find a set of parameters
that works on the new data.

On the algorithm side, object counting consists of performing a sequence of steps to process an input image
into densities over regions of the brain, and is relatively simple to understand and implement on a Numpy
array with small dataset size. On larger datasets, we need to do long-running distributed computation
that are hard to debug and requires tens of minutes or hours if we need to rerun the computation
(often just to display the results again).

The SegProcess Class
********************

The **SegProcess** class in module **cvpl_tools.im.seg_process** provides a convenient way for us to define
a step in multi-step image processing pipeline for distributed, interpretable and cached image data analysis.

Consider a function that counts the number of cells in a 3d-block of brightness map:

.. code-block:: Python
import dask.array as da
def cell_count(block3d: da.Array):
"""Count the number of cells in a dask 3d brightness image"""
mask = threshold(block3d, 0.4) # use simple thresholding to mark pixels where cells are as 1s
inst = instance_segmentation(mask) # segment the mask into contours
cell_cnt = count_inst(inst) # count each contour as a cell
return cell_cnt
There are a few issues:

1. Lack of interpretability. Often when we first run this function some bug will show up, for example
when the output cell count is unexpectedly large. Debugging this becomes a problem since we don't know
if one of the three steps in the cell_count function did not work as expected, or if the algorithm does
not work well on the input data for some reason. In either case, if we want to find the root cause of
the problem we very often end up adding code and rerun the pipeline to display the output of each step
to see if they match with our expectations.

2. Result is not cached. Ideally the pipeline is run once and we get the result, but more often than
not the result may need to be used in different places (visualization, analysis etc.). Caching these
results makes sure computation is done only once, which is necessary when we work with costly algorithms
on hundreds of GBs of data.

SegProcess is designed to address these issues, with the basic idea to integrate visualization as
part of the cell_count function, and cache the result of each step into a file in a **CacheDirectory**.

The class supports the following use cases:

1. dask-support. Inputs are expected to be either numpy array, dask array, or cvpl.im.ndblock.NDBlock
objects. In particular, dask.Array and NDBlock are suitable for parallel or distributed image processing
workflows.

2. integration of Napari. **forward()** function of a SegProcess object has a viewer attribute that
defaults to None. By passing a Napari viewer to this parameter, the forward process will add intermediate
images or centroids to the Napari viewer for easier debugging.

3. intermediate result caching. **CacheDirectory** class provides a hierarchical caching directory,
where each **forward()** call will either create a new directory or load from existing cache directory
based on the **cid** parameter passed to the function.

Now we discuss how to define such a pipeline.

Extending the Pipeline
**********************

The first step of building a pipeline is to break a segmentation algorithm down to steps that process the
image in different formats. As an example, we may implement a pipeline as IN -> BS -> OS -> CC, where:

- IN - Input Image (np.float32) between min=0 and max=1, this is the brightness dask image as input
- BS - Binary Segmentation (3d, np.uint8), this is the binary mask single class segmentation
- OS - Ordinal Segmentation (3d, np.int32), this is the 0-N where contour 1-N each denotes an object; also single class
- CC - Cell Count Map (3d, np.float32), a cell count number (estimate, can be float) for each block

Mapping from IN to BS comes in two choices. One is to simply take threshold > some number as cells and the
rest as background. Another is to use a trained machine learned algorithm to do binary segmentation. Mapping
from BS to OS also comes in two choices. Either directly treating each connected volume as a separate cell,
or use watershed to get finner segmentation mask. Finally, We can count cells in the instance
segmentation mask by perhaps look at how many seperate contours we have found.

In some cases this is not necessary if we know what algorithm works best, but abstracting the algorithm
intermediate results as four types IN, BS, OS, CC have helped us identify which part of the pipeline can
be reused and which part may have variations in the algorithm used.

We can then plan the processing steps we need to define as follows:

1. thresholding (IN -> BS)
2. model_prediction (IN -> BS)
3. direct_inst_segmentation (BS -> OS)
4. watershed_inst_segmentation (BS -> OS)
5. cell_cnt_from_inst (OS -> CC)

How do we go from this plan to actually code these steps? Subclassing **SegProcess** is the recommended way.
This means to create a subclass that defines the **forward()** method, which takes arbitrary inputs
and two optional parameters: cid and viewer.

- cid specifies the subdirectory under the cache directory (set by the **set_tmpdir** method of the base
class) to save intermediate files. If not provided (cid=None),
then the cache will be saved in a temporary directory that will be removed when the CacheDirectory is
closed. If provided, this cache file will persist. Within the forward() method, you should use
self.tmpdir.cache() and self.tmpdir.cache_im() to create cache files:

.. code-block:: Python
class ExampleSegProcess(SegProcess):
def forward(self, im, cid: str = None, viewer: napari.Viewer = None):
cache_exists, cache_path = self.tmpdir.cache(is_dir=True, cid=cid)
# in the case cache does not exists, cache_path.path is an empty path we can create a folder in:
if not cache_exists:
os.makedirs(cache_path.path)
result = compute_result(im)
save(cache_path.path, result)
result = load(cache_path.path)
# ...
- The viewer parameter specifies the napari viewer to display the intermediate results. If not provided
(viewer=None), then no computation will be done to visualize the image. Within the forward() method, you
should use viewer.add_labels(), lc_interpretable_napari() or temp_directory.cache_im() while passing in
viewer_args argument to display your results:

.. code-block:: Python
class ExampleSegProcess(SegProcess):
def forward(self, im, cid: str = None, viewer: napari.Viewer = None):
result = compute_result(im)
result = self.tmpdir.cache_im(lambda: result, cid=cid, viewer_args=dict(
viewer=viewer, # The napari viewer, visualization will be skipped if viewer is None
is_label=True, # If True, viewer.add_labels() will be called; if False, viewer.add_image() will be called
preferred_chunksize=(1, 4096, 4096), # image will be converted to this chunksize when saved, and converted back when loaded
multiscale=4 if viewer else 0, # maximum downsampling level of OME ZARR files, necessary for very large images
))
return result
viewer_args is a parameter that allows us to visualize the saved results as part of the caching function. The
reason we need this is that displaying the saved result often requires a different (flatter) chunk size for
fast loading of cross-sectional image, and also requires downsampling for zooming in/out of larger images.

Running the Pipeline
********************

Now we have defined a ExampleSegProcess class, the next step is to write our script that uses the pipeline to
segment an input dataset. Note we need a dask cluster and a temporary directory setup before running the
forward() method.

.. code-block:: Python
if __name__ == '__main__': # Only for main thread, worker threads will not run this
TMP_PATH = "path/to/temporary/directory"
import dask
from dask.distributed import Client
import napari
with (dask.config.set({'temporary_directory': TMP_PATH}),
imfs.CacheDirectory(
f'{TMP_PATH}/CacheDirectory',
remove_when_done=False,
read_if_exists=True) as temp_directory):
client = Client(threads_per_worker=12, n_workers=1)
im = load_im(path) # this is our input dask.Array object to be segmented
process = ExampleSegProcess()
viewer = napari.Viewer()
process.forward(im, cid='cell_count_cache', viewer=viewer)
viewer.show(block=True)
client.close()
See `Boilerplate Code <GettingStarted/boilerplate>`_ for explanation of the boilerplate code.

To learn more, see the API pages for cvpl_tools.im.seg_process, cvpl_tools.im.fs and
cvpl_tools.im.ndblock modules.
6 changes: 6 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,17 @@ or on cloud.

Introduction <self>
Viewing and IO of OME Zarr <GettingStarted/ome_zarr>
Boilerplate Code <GettingStarted/boilerplate>
Defining Segmentation Pipeline <GettingStarted/segmentation_pipeline>

.. toctree::
:maxdepth: 2
:caption: API Reference

cvpl_tools.napari.zarr.py <API/napari_zarr>
cvpl_tools.ome_zarr.io.py <API/ome_zarr_io>
cvpl_tools.im.fs <API/imfs>
cvpl_tools.im.ndblock <API/ndblock>
cvpl_tools.im.seg_process <API/seg_process>


Loading

0 comments on commit b846450

Please sign in to comment.