Skip to content

Commit

Permalink
update docs about caching results
Browse files Browse the repository at this point in the history
  • Loading branch information
Karl5766 committed Sep 27, 2024
1 parent 8d229bd commit d84ae81
Show file tree
Hide file tree
Showing 8 changed files with 192 additions and 14 deletions.
57 changes: 54 additions & 3 deletions docs/API/seg_process.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,61 @@ output shape of that block, use :code:`BlockToBlockProcess`.
:members:
.. autoclass:: cvpl_tools.im.seg_process.BinaryAndCentroidListToInstance
:members:
.. autoclass:: cvpl_tools.im.seg_process.DirectOSToLC

bs_to_os
********
binary segmentation to ordinal segmentation

This section contains algorithms whose input is binary (0-1) segmentation mask, and output is instance segmentation
(0-N) integer mask where the output ndarray is of the same shape as input.

.. autoclass:: cvpl_tools.im.process.bs_to_os.DirectBSToOS
:members:
.. autoclass:: cvpl_tools.im.process.bs_to_os.Watershed3SizesBSToOS
:members:

lc_to_cc
********
list of centroids to cell counts

This section contains algorithms whose input is a 2d array or a 2d array of each block describing the centroid
locations and meta information about the objects associated with the centroids in each block. The output is a single
number summarizing statistics for each block.

.. autoclass:: cvpl_tools.im.process.lc_to_cc.CountLCBySize
:members:
.. autoclass:: cvpl_tools.im.process.lc_to_cc.CountLCEdgePenalized
:members:

os_to_cc
********
oridnal segmentation to cell counts

This section contains algorithms whose input is instance segmentation (0-N) integer mask where the output is a single
number summarizing statistics for each block.

.. autoclass:: cvpl_tools.im.process.os_to_cc.CountOSBySize
:members:

os_to_lc
********
ordinal segmentation to list of centroids

This section contains algorithms whose input is instance segmentation (0-N) integer mask where the output is a list
of centroids with meta information.

.. autoclass:: cvpl_tools.im.process.os_to_lc.DirectOSToLC
:members:
.. autoclass:: cvpl_tools.im.seg_process.CountLCEdgePenalized

any_to_any
**********
other

This sections contain image processing steps whose inputs and outputs may adapt to different types of data or are not
adequately described by the current classifications.

.. autoclass:: cvpl_tools.im.process.any_to_any.DownsamplingByIntFactor
:members:
.. autoclass:: cvpl_tools.im.seg_process.CountOSBySize
.. autoclass:: cvpl_tools.im.process.any_to_any.UpsamplingByIntFactor
:members:

131 changes: 131 additions & 0 deletions docs/GettingStarted/result_caching.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
.. _result_caching:

Result Caching
##############

Overview
********
In many cases it's useful to cache some of the intermediate results instead of discarding all the computation results
all at once. Think of the following cases where you may have encountered when writing a long-running image processing
workflow:

1. The cell density for each region in the scan is computed but the number does not match up with what's expected,
so you want to display a heatmap in a graphical viewer showing cell density. The final results you got is text
output in the console, requiring redo the computation to display.

2. Some error occurs and you need to find out why a step in the computation causes the issue, but it's rather
difficult to understand what went wrong without displaying some intermediate results to aid debugging.

3. Graphically showing the the algorithm works step-by-step will be very help in identifying causes of
issues, but requires saving all the results onto disk and chunked in a viewer-friendly format.

In all cases above, caching all the intermediate results help reduce headaches and risks of unknown errors coming
from the difficulty of debugging in an image processing and distributed computing environment. The basic strategy
we use to overcome these is to cache all the results inside a directory tree. Each step saves all its
intermediate and final results onto a node in the tree. The node's children are directories saved by its
sub-steps.

Here, the outputs of a processing step (function) may contain intermediate images (such as .ome.zarr), log files
(.txt) and graphs generated by plotting libraries.

We describe the CacheDirectory interface in details below.

CacheRootDirectory
******************
Every cache directory tree starts with a CacheRootDirectory node at its root, which is the only node of that class in
the tree. In order to create a cache directory tree you need to create a CacheRootDirectory node, as follows:

.. code-block:: Python
with imfs.CacheRootDirectory(
f'path/to/root',
remove_when_done=False,
read_if_exists=True) as temp_directory:
cache_dir = temp_directory.cache_subdir(cid='test')
This creates two directories 'path/to/root' and 'path/to/root/dir_cache_test' on the first run,
the naming of the subfolder indicates that it is :code:`dir` a directory and :code:`cache` a
persistent cache instead of a temporary folder in that location.
The next time the program is run, it will not create new folders but directly read from existing ones.

When :code:`remove_when_done=True` and :code:`read_if_exists=False`, we get a pure temporary cache directory that
will be deleted when the program finishes. The next time the program is run we always create a new one.

CacheDirectory
**************
A CacheDirectory makes up a node in the cache directory tree that can contain zero or
more CacheDirectory and CachePath instances as its children. CacheRootDirectory is a
subclass of CacheDirectory.

When we create a CacheDirectory object, the directory is created if not exists, otherwise the
cache is read from file on disk. To know whether the directory is created anew,
use the attribute :code:`cache_dir.exists`. To create sub-directory,
use the following format:

.. code-block:: Python
sub_cache_path = cache_dir.cache_subpath(cid='subpath1') # leaf node
sub_cache_dir = cache_dir.cache_subdir(cid='subdir1') # non-leaf node
Similarly, use :code:`sub_cache_path.exists` to determine if the path exists or not. Note even
though CachePath class is named path instead of directory, it is a location representing a leaf node,
that most often points to a directory instead of a file in the file system.

CachePointer
************
CachePointer is a struct containing two attributes: A parent directory and a cid indicating where
under this directory the pointer points to. Both CachePath and CachePointer references a location
where file or directory may or may not exist, but CachePointer is designed to be flexible that
you can decide whether to create a CacheDirectory node or a non-CacheDirectory (leaf) node. Below
shows equivalent ways to create cache files and folders:

.. code-block:: Python
sub_cache_path = cache_dir.cache_subpath(cid='subpath2')
# Equivalently
cptr = cache_dir.cache(cid='subpath2')
sub_cache_path = cptr.subpath()
sub_cache_dir = cache_dir.cache_subdir(cid='subdir2')
# Equivalently
cptr = cache_dir.cache(cid='subdir2')
sub_cache_path = cptr.subdir()
It may seem unnecessary to create a CachePointer instance just to defer the decision of whether to create
a CachePath or a CacheDirectory child, but it comes in handy when you want to design the interface for a
function where the caller does not need to care whether you want a leaf node or a non-leaf node.

.. code-block:: Python
# implementation 1
def compute(im, cptr):
result = (im + 1) * 3
cache_path = cptr.subpath()
if not cache_path.exists:
result.save(cache_path.abs_path)
return load(cache_path.abs_path)
# implementation 2 (functionally equivalent but creates two sub-directories)
def compute(im, cptr):
cache_dir = cptr.subdir()
im2 = plus_one(im=im, cptr=cache_dir.cache('plus_one'))
im3 = times_three(im=im2, cptr=cache_dir.cache('times_three'))
return im3
result = compute(im=input_im, temp_directory.cache(cid='compute'))
# DISPLAY RESULT...
Tips
****
- when writing a compute function that cache to a single location, receive a CachePointer object instead of
a CachePath or CacheDirectory object. This brings flexibility as it's up to the callee to decide whether
a sub-path or a sub-directory is needed and you may even decide
to not create the directory at all if no cache is needed, separating the function's implementation
from its interface.
- Dask duplicates some computation twice because it does not support on-disk caching directly, using cache
files in each step can avoid this issue and help speedup computation.
- cache the images in a viewer-readable format. For OME-ZARR a flat image chunking scheme is
suitable for 2D viewers like Napari. Rechunking when loading back to memory may be slower but is usually
not a big issue.
9 changes: 4 additions & 5 deletions docs/GettingStarted/segmentation_pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -201,12 +201,11 @@ At this point you may have a better understanding of how these pipeline steps wo
it.

- For parameters that changes how the images are processed, cvpl_tools' preference is to pass them
through the :code:`__init__` method of the :code:`SegProcess` subclass.

through the :code:`__init__` method of the :code:`SegProcess` subclass.
- For parameters that changes how the viewer displays the image, or how the image is cached (caching is
often related to display e.g. storing chunks as flat images will allow faster cross section display in
Napari), these parameters are provided through the :code:`viewer_args` argument of the :code:`forward()`
function.
often related to display e.g. storing chunks as flat images will allow faster cross section display in
Napari), these parameters are provided through the :code:`viewer_args` argument of the :code:`forward()`
function.

To learn more, see the API pages for :code:`cvpl_tools.im.seg_process`, :code:`cvpl_tools.im.fs` and
:code:`cvpl_tools.im.ndblock` modules.
2 changes: 1 addition & 1 deletion docs/GettingStarted/setting_up_the_script.rst
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ which there are intermediate files. To create a cache directory, we write
# Use case #1. Create a data directory for caching computation results
cache_path = temp_directory.cache_subpath(cid='some_cache_path')
if not cache_path.exists():
if not cache_path.exists:
os.makedirs(cache_path.abs_path, exists_ok=True)
# PUT CODE HERE: Now write your data into cache_path.abs_path and load it back later
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
project = 'cvpl_tools'
copyright = '2024, KarlHanUW'
author = 'KarlHanUW'
release = '0.4.0'
release = '0.6.1'

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ or on cloud.
Viewing and IO of OME Zarr <GettingStarted/ome_zarr>
Setting Up the Script <GettingStarted/setting_up_the_script>
Defining Segmentation Pipeline <GettingStarted/segmentation_pipeline>
Result Caching <GettingStarted/result_caching>

.. toctree::
:maxdepth: 2
Expand Down
Empty file removed src/cvpl_tools/im/fs/__init__.py
Empty file.
4 changes: 0 additions & 4 deletions src/cvpl_tools/im/fs/imio.py

This file was deleted.

0 comments on commit d84ae81

Please sign in to comment.