Refactors caching examples to be in a single place

Updates links and adds README.
tyapochkin · Feb 20, 2024 · 2cfe00c · 2cfe00c
1 parent c9ef352
commit 2cfe00c
Show file tree

Hide file tree

Showing 14 changed files with 62 additions and 20 deletions.
diff --git a/docs/how-tos/cache-nodes.rst b/docs/how-tos/cache-nodes.rst
@@ -6,4 +6,4 @@ Sometimes it is convenient to cache intermediate nodes. This is especially usefu
 
 For example, if a particular node takes a long time to calculate (perhaps it extracts data from an outside source or performs some heavy computation), you can annotate it with "cache" tag. The first time the DAG is executed, that node will be cached to disk. If then you do some development on any of the downstream nodes, the subsequent executions will load the cached node instead of repeating the computation.
 
-See the full tutorial `here <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes>`_.
+See the examples here `here <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes>`_.
diff --git a/examples/caching_nodes/README.md b/examples/caching_nodes/README.md
@@ -1,18 +1,8 @@
-# Caching Graph Adapter
+Here you'll find two adapters that allow you to cache the results of your functions.
 
-You can use `CachingGraphAdapter` to cache certain nodes.
+The first one is the `DiskCacheAdapter`, which uses the `diskcache` library to store the results on disk.
 
-This is great for:
+The second one is the `CachingGraphAdapter`, which requires you to tag functions to cache along with the
+serialization format.
 
-1. Iterating during development, where you don't want to recompute certain expensive function calls.
-2. Providing some lightweight means to control recomputation in production, by controlling whether a "cached file" exists or not.
-
-For iterating during development, the general process would be:
-
-1. Write your functions.
-2. Mark them with `tag(cache="SERIALIZATION_FORMAT")`
-3. Use the CachingGraphAdapter and pass that to the Driver to turn on caching for these functions.
-    a. If at any point in your development you need to re-run a cached node, you can pass
-       its name to the adapter in the `force_compute` argument. Then, this node and its downstream
-       nodes will be computed instead of loaded from cache.
-4. When no longer required, you can just skip (3) and any caching behavior will be skipped.
+Both have their sweet spots and trade-offs. We invite you play with them and provide feedback on which one you prefer.
diff --git a/examples/caching_nodes/business_logic.py b/examples/caching_nodes/business_logic.py
diff --git a/examples/caching_nodes/caching_graph_adapter/README.md b/examples/caching_nodes/caching_graph_adapter/README.md
@@ -0,0 +1,18 @@
+# Caching Graph Adapter
+
+You can use `CachingGraphAdapter` to cache certain nodes.
+
+This is great for:
+
+1. Iterating during development, where you don't want to recompute certain expensive function calls.
+2. Providing some lightweight means to control recomputation in production, by controlling whether a "cached file" exists or not.
+
+For iterating during development, the general process would be:
+
+1. Write your functions.
+2. Mark them with `tag(cache="SERIALIZATION_FORMAT")`
+3. Use the CachingGraphAdapter and pass that to the Driver to turn on caching for these functions.
+    a. If at any point in your development you need to re-run a cached node, you can pass
+       its name to the adapter in the `force_compute` argument. Then, this node and its downstream
+       nodes will be computed instead of loaded from cache.
+4. When no longer required, you can just skip (3) and any caching behavior will be skipped.
diff --git a/examples/caching_nodes/caching_graph_adapter/business_logic.py b/examples/caching_nodes/caching_graph_adapter/business_logic.py
@@ -0,0 +1,35 @@
+import pandas as pd
+
+"""
+Copied from the hello world example.
+"""
+
+
+def avg_3wk_spend(spend: pd.Series) -> pd.Series:
+    """Rolling 3 week average spend."""
+    return spend.rolling(3).mean()
+
+
+def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
+    """The cost per signup in relation to spend."""
+    return spend / signups
+
+
+def spend_mean(spend: pd.Series) -> float:
+    """Shows function creating a scalar. In this case it computes the mean of the entire column."""
+    return spend.mean()
+
+
+def spend_zero_mean(spend: pd.Series, spend_mean: float) -> pd.Series:
+    """Shows function that takes a scalar. In this case to zero mean spend."""
+    return spend - spend_mean
+
+
+def spend_std_dev(spend: pd.Series) -> float:
+    """Function that computes the standard deviation of the spend column."""
+    return spend.std()
+
+
+def spend_zero_mean_unit_variance(spend_zero_mean: pd.Series, spend_std_dev: float) -> pd.Series:
+    """Function showing one way to make spend have zero mean and unit variance."""
+    return spend_zero_mean / spend_std_dev
diff --git a/examples/caching_nodes/caching_nodes.ipynb → ...caching_graph_adapter/caching_nodes.ipynb b/examples/caching_nodes/caching_nodes.ipynb → ...caching_graph_adapter/caching_nodes.ipynb
diff --git a/examples/caching_nodes/data_loaders.py → ...des/caching_graph_adapter/data_loaders.py b/examples/caching_nodes/data_loaders.py → ...des/caching_graph_adapter/data_loaders.py
diff --git a/examples/caching_nodes/requirements.txt → ...es/caching_graph_adapter/requirements.txt b/examples/caching_nodes/requirements.txt → ...es/caching_graph_adapter/requirements.txt
diff --git a/examples/caching_nodes/run.py → ...aching_nodes/caching_graph_adapter/run.py b/examples/caching_nodes/run.py → ...aching_nodes/caching_graph_adapter/run.py
diff --git a/examples/cache_hook/README.md → ...caching_nodes/diskcache_adapter/README.md b/examples/cache_hook/README.md → ...caching_nodes/diskcache_adapter/README.md
@@ -1,5 +1,5 @@
-# Cache hook
-This hook uses the [diskcache](https://grantjenks.com/docs/diskcache/tutorial.html) to cache node execution on disk. The cache key is a tuple of the function's
+# DiskCache Adapter
+This adapter uses [diskcache](https://grantjenks.com/docs/diskcache/tutorial.html) to cache node execution on disk. The cache key is a tuple of the function's
 `(source code, input a, ..., input n)`. This means, a function will only be executed once for a given set of inputs,
 and source code hash. The cache is stored in a directory of your choice, and it can be shared across different runs of your
 code. That way as you develop, if the inputs and the code haven't changed, the function will not be executed again and
@@ -16,7 +16,7 @@ Disk cache has great features to:
 > cache (both keys and values). Learn more about [caveats](https://grantjenks.com/docs/diskcache/tutorial.html#caveats).
 
 > ❓ To store artifacts robustly, please use Hamilton materializers or the
-> [CachingGraphAdapter](https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes) instead.
+> [CachingGraphAdapter](https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes/caching_graph_adatper) instead.
 > The `CachingGraphAdapter` stores tagged nodes directly on the file system using common formats (JSON, CSV, Parquet, etc.).
 > However, it isn't aware of your function version and requires you to manually manage your disk space.
 

diff --git a/examples/cache_hook/functions.py → ...hing_nodes/diskcache_adapter/functions.py b/examples/cache_hook/functions.py → ...hing_nodes/diskcache_adapter/functions.py
diff --git a/examples/cache_hook/notebook.ipynb → ...ng_nodes/diskcache_adapter/notebook.ipynb b/examples/cache_hook/notebook.ipynb → ...ng_nodes/diskcache_adapter/notebook.ipynb
diff --git a/examples/cache_hook/requirements.txt → ..._nodes/diskcache_adapter/requirements.txt b/examples/cache_hook/requirements.txt → ..._nodes/diskcache_adapter/requirements.txt
diff --git a/examples/cache_hook/run.py → ...es/caching_nodes/diskcache_adapter/run.py b/examples/cache_hook/run.py → ...es/caching_nodes/diskcache_adapter/run.py