update docs about caching results

khanlab · Sep 27, 2024 · d84ae81 · d84ae81
1 parent 8d229bd
commit d84ae81
Show file tree

Hide file tree

Showing 8 changed files with 192 additions and 14 deletions.
diff --git a/docs/API/seg_process.rst b/docs/API/seg_process.rst
@@ -35,10 +35,61 @@ output shape of that block, use :code:`BlockToBlockProcess`.
     :members:
 .. autoclass:: cvpl_tools.im.seg_process.BinaryAndCentroidListToInstance
     :members:
-.. autoclass:: cvpl_tools.im.seg_process.DirectOSToLC
+
+bs_to_os
+********
+binary segmentation to ordinal segmentation
+
+This section contains algorithms whose input is binary (0-1) segmentation mask, and output is instance segmentation
+(0-N) integer mask where the output ndarray is of the same shape as input.
+
+.. autoclass:: cvpl_tools.im.process.bs_to_os.DirectBSToOS
+    :members:
+.. autoclass:: cvpl_tools.im.process.bs_to_os.Watershed3SizesBSToOS
+    :members:
+
+lc_to_cc
+********
+list of centroids to cell counts
+
+This section contains algorithms whose input is a 2d array or a 2d array of each block describing the centroid
+locations and meta information about the objects associated with the centroids in each block. The output is a single
+number summarizing statistics for each block.
+
+.. autoclass:: cvpl_tools.im.process.lc_to_cc.CountLCBySize
+    :members:
+.. autoclass:: cvpl_tools.im.process.lc_to_cc.CountLCEdgePenalized
+    :members:
+
+os_to_cc
+********
+oridnal segmentation to cell counts
+
+This section contains algorithms whose input is instance segmentation (0-N) integer mask where the output is a single
+number summarizing statistics for each block.
+
+.. autoclass:: cvpl_tools.im.process.os_to_cc.CountOSBySize
+    :members:
+
+os_to_lc
+********
+ordinal segmentation to list of centroids
+
+This section contains algorithms whose input is instance segmentation (0-N) integer mask where the output is a list
+of centroids with meta information.
+
+.. autoclass:: cvpl_tools.im.process.os_to_lc.DirectOSToLC
     :members:
-.. autoclass:: cvpl_tools.im.seg_process.CountLCEdgePenalized
+
+any_to_any
+**********
+other
+
+This sections contain image processing steps whose inputs and outputs may adapt to different types of data or are not
+adequately described by the current classifications.
+
+.. autoclass:: cvpl_tools.im.process.any_to_any.DownsamplingByIntFactor
     :members:
-.. autoclass:: cvpl_tools.im.seg_process.CountOSBySize
+.. autoclass:: cvpl_tools.im.process.any_to_any.UpsamplingByIntFactor
     :members:
 
diff --git a/docs/GettingStarted/result_caching.rst b/docs/GettingStarted/result_caching.rst
@@ -0,0 +1,131 @@
+.. _result_caching:
+
+Result Caching
+##############
+
+Overview
+********
+In many cases it's useful to cache some of the intermediate results instead of discarding all the computation results
+all at once. Think of the following cases where you may have encountered when writing a long-running image processing
+workflow:
+
+1. The cell density for each region in the scan is computed but the number does not match up with what's expected,
+so you want to display a heatmap in a graphical viewer showing cell density. The final results you got is text
+output in the console, requiring redo the computation to display.
+
+2. Some error occurs and you need to find out why a step in the computation causes the issue, but it's rather
+difficult to understand what went wrong without displaying some intermediate results to aid debugging.
+
+3. Graphically showing the the algorithm works step-by-step will be very help in identifying causes of
+issues, but requires saving all the results onto disk and chunked in a viewer-friendly format.
+
+In all cases above, caching all the intermediate results help reduce headaches and risks of unknown errors coming
+from the difficulty of debugging in an image processing and distributed computing environment. The basic strategy
+we use to overcome these is to cache all the results inside a directory tree. Each step saves all its
+intermediate and final results onto a node in the tree. The node's children are directories saved by its
+sub-steps.
+
+Here, the outputs of a processing step (function) may contain intermediate images (such as .ome.zarr), log files
+(.txt) and graphs generated by plotting libraries.
+
+We describe the CacheDirectory interface in details below.
+
+CacheRootDirectory
+******************
+Every cache directory tree starts with a CacheRootDirectory node at its root, which is the only node of that class in
+the tree. In order to create a cache directory tree you need to create a CacheRootDirectory node, as follows:
+
+.. code-block:: Python
+
+    with imfs.CacheRootDirectory(
+            f'path/to/root',
+            remove_when_done=False,
+            read_if_exists=True) as temp_directory:
+        cache_dir = temp_directory.cache_subdir(cid='test')
+
+This creates two directories 'path/to/root' and 'path/to/root/dir_cache_test' on the first run,
+the naming of the subfolder indicates that it is :code:`dir` a directory and :code:`cache` a
+persistent cache instead of a temporary folder in that location.
+The next time the program is run, it will not create new folders but directly read from existing ones.
+
+When :code:`remove_when_done=True` and :code:`read_if_exists=False`, we get a pure temporary cache directory that
+will be deleted when the program finishes. The next time the program is run we always create a new one.
+
+CacheDirectory
+**************
+A CacheDirectory makes up a node in the cache directory tree that can contain zero or
+more CacheDirectory and CachePath instances as its children. CacheRootDirectory is a
+subclass of CacheDirectory.
+
+When we create a CacheDirectory object, the directory is created if not exists, otherwise the
+cache is read from file on disk. To know whether the directory is created anew,
+use the attribute :code:`cache_dir.exists`. To create sub-directory,
+use the following format:
+
+.. code-block:: Python
+
+    sub_cache_path = cache_dir.cache_subpath(cid='subpath1')  # leaf node
+    sub_cache_dir = cache_dir.cache_subdir(cid='subdir1')  # non-leaf node
+
+Similarly, use :code:`sub_cache_path.exists` to determine if the path exists or not. Note even
+though CachePath class is named path instead of directory, it is a location representing a leaf node,
+that most often points to a directory instead of a file in the file system.
+
+CachePointer
+************
+CachePointer is a struct containing two attributes: A parent directory and a cid indicating where
+under this directory the pointer points to. Both CachePath and CachePointer references a location
+where file or directory may or may not exist, but CachePointer is designed to be flexible that
+you can decide whether to create a CacheDirectory node or a non-CacheDirectory (leaf) node. Below
+shows equivalent ways to create cache files and folders:
+
+.. code-block:: Python
+
+    sub_cache_path = cache_dir.cache_subpath(cid='subpath2')
+    # Equivalently
+    cptr = cache_dir.cache(cid='subpath2')
+    sub_cache_path = cptr.subpath()
+
+    sub_cache_dir = cache_dir.cache_subdir(cid='subdir2')
+    # Equivalently
+    cptr = cache_dir.cache(cid='subdir2')
+    sub_cache_path = cptr.subdir()
+
+It may seem unnecessary to create a CachePointer instance just to defer the decision of whether to create
+a CachePath or a CacheDirectory child, but it comes in handy when you want to design the interface for a
+function where the caller does not need to care whether you want a leaf node or a non-leaf node.
+
+.. code-block:: Python
+
+    # implementation 1
+    def compute(im, cptr):
+        result = (im + 1) * 3
+        cache_path = cptr.subpath()
+        if not cache_path.exists:
+            result.save(cache_path.abs_path)
+        return load(cache_path.abs_path)
+
+    # implementation 2 (functionally equivalent but creates two sub-directories)
+    def compute(im, cptr):
+        cache_dir = cptr.subdir()
+        im2 = plus_one(im=im, cptr=cache_dir.cache('plus_one'))
+        im3 = times_three(im=im2, cptr=cache_dir.cache('times_three'))
+        return im3
+
+    result = compute(im=input_im, temp_directory.cache(cid='compute'))
+
+    # DISPLAY RESULT...
+
+
+Tips
+****
+- when writing a compute function that cache to a single location, receive a CachePointer object instead of
+  a CachePath or CacheDirectory object. This brings flexibility as it's up to the callee to decide whether
+  a sub-path or a sub-directory is needed and you may even decide
+  to not create the directory at all if no cache is needed, separating the function's implementation
+  from its interface.
+- Dask duplicates some computation twice because it does not support on-disk caching directly, using cache
+  files in each step can avoid this issue and help speedup computation.
+- cache the images in a viewer-readable format. For OME-ZARR a flat image chunking scheme is
+  suitable for 2D viewers like Napari. Rechunking when loading back to memory may be slower but is usually
+  not a big issue.
diff --git a/docs/GettingStarted/segmentation_pipeline.rst b/docs/GettingStarted/segmentation_pipeline.rst
@@ -201,12 +201,11 @@ At this point you may have a better understanding of how these pipeline steps wo
 it.
 
 - For parameters that changes how the images are processed, cvpl_tools' preference is to pass them
-through the :code:`__init__` method of the :code:`SegProcess` subclass.
-
+  through the :code:`__init__` method of the :code:`SegProcess` subclass.
 - For parameters that changes how the viewer displays the image, or how the image is cached (caching is
-often related to display e.g. storing chunks as flat images will allow faster cross section display in
-Napari), these parameters are provided through the :code:`viewer_args` argument of the :code:`forward()`
-function.
+  often related to display e.g. storing chunks as flat images will allow faster cross section display in
+  Napari), these parameters are provided through the :code:`viewer_args` argument of the :code:`forward()`
+  function.
 
 To learn more, see the API pages for :code:`cvpl_tools.im.seg_process`, :code:`cvpl_tools.im.fs` and
 :code:`cvpl_tools.im.ndblock` modules.
diff --git a/docs/GettingStarted/setting_up_the_script.rst b/docs/GettingStarted/setting_up_the_script.rst
@@ -148,7 +148,7 @@ which there are intermediate files. To create a cache directory, we write
 
             # Use case #1. Create a data directory for caching computation results
             cache_path = temp_directory.cache_subpath(cid='some_cache_path')
-            if not cache_path.exists():
+            if not cache_path.exists:
                 os.makedirs(cache_path.abs_path, exists_ok=True)
                 # PUT CODE HERE: Now write your data into cache_path.abs_path and load it back later
 

diff --git a/docs/conf.py b/docs/conf.py
@@ -9,7 +9,7 @@
 project = 'cvpl_tools'
 copyright = '2024, KarlHanUW'
 author = 'KarlHanUW'
-release = '0.4.0'
+release = '0.6.1'
 
 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

diff --git a/docs/index.rst b/docs/index.rst
@@ -33,6 +33,7 @@ or on cloud.
    Viewing and IO of OME Zarr <GettingStarted/ome_zarr>
    Setting Up the Script <GettingStarted/setting_up_the_script>
    Defining Segmentation Pipeline <GettingStarted/segmentation_pipeline>
+   Result Caching <GettingStarted/result_caching>
 
 .. toctree::
    :maxdepth: 2

diff --git a/src/cvpl_tools/im/fs/__init__.py b/src/cvpl_tools/im/fs/__init__.py
diff --git a/src/cvpl_tools/im/fs/imio.py b/src/cvpl_tools/im/fs/imio.py