Skip to content

Commit

Permalink
Refactors caching examples to be in a single place
Browse files Browse the repository at this point in the history
Updates links and adds README.
  • Loading branch information
skrawcz committed Feb 20, 2024
1 parent c9ef352 commit 2cfe00c
Show file tree
Hide file tree
Showing 14 changed files with 62 additions and 20 deletions.
2 changes: 1 addition & 1 deletion docs/how-tos/cache-nodes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ Sometimes it is convenient to cache intermediate nodes. This is especially usefu

For example, if a particular node takes a long time to calculate (perhaps it extracts data from an outside source or performs some heavy computation), you can annotate it with "cache" tag. The first time the DAG is executed, that node will be cached to disk. If then you do some development on any of the downstream nodes, the subsequent executions will load the cached node instead of repeating the computation.

See the full tutorial `here <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes>`_.
See the examples here `here <https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes>`_.
20 changes: 5 additions & 15 deletions examples/caching_nodes/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,8 @@
# Caching Graph Adapter
Here you'll find two adapters that allow you to cache the results of your functions.

You can use `CachingGraphAdapter` to cache certain nodes.
The first one is the `DiskCacheAdapter`, which uses the `diskcache` library to store the results on disk.

This is great for:
The second one is the `CachingGraphAdapter`, which requires you to tag functions to cache along with the
serialization format.

1. Iterating during development, where you don't want to recompute certain expensive function calls.
2. Providing some lightweight means to control recomputation in production, by controlling whether a "cached file" exists or not.

For iterating during development, the general process would be:

1. Write your functions.
2. Mark them with `tag(cache="SERIALIZATION_FORMAT")`
3. Use the CachingGraphAdapter and pass that to the Driver to turn on caching for these functions.
a. If at any point in your development you need to re-run a cached node, you can pass
its name to the adapter in the `force_compute` argument. Then, this node and its downstream
nodes will be computed instead of loaded from cache.
4. When no longer required, you can just skip (3) and any caching behavior will be skipped.
Both have their sweet spots and trade-offs. We invite you play with them and provide feedback on which one you prefer.
1 change: 0 additions & 1 deletion examples/caching_nodes/business_logic.py

This file was deleted.

18 changes: 18 additions & 0 deletions examples/caching_nodes/caching_graph_adapter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Caching Graph Adapter

You can use `CachingGraphAdapter` to cache certain nodes.

This is great for:

1. Iterating during development, where you don't want to recompute certain expensive function calls.
2. Providing some lightweight means to control recomputation in production, by controlling whether a "cached file" exists or not.

For iterating during development, the general process would be:

1. Write your functions.
2. Mark them with `tag(cache="SERIALIZATION_FORMAT")`
3. Use the CachingGraphAdapter and pass that to the Driver to turn on caching for these functions.
a. If at any point in your development you need to re-run a cached node, you can pass
its name to the adapter in the `force_compute` argument. Then, this node and its downstream
nodes will be computed instead of loaded from cache.
4. When no longer required, you can just skip (3) and any caching behavior will be skipped.
35 changes: 35 additions & 0 deletions examples/caching_nodes/caching_graph_adapter/business_logic.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
import pandas as pd

"""
Copied from the hello world example.
"""


def avg_3wk_spend(spend: pd.Series) -> pd.Series:
"""Rolling 3 week average spend."""
return spend.rolling(3).mean()


def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
"""The cost per signup in relation to spend."""
return spend / signups


def spend_mean(spend: pd.Series) -> float:
"""Shows function creating a scalar. In this case it computes the mean of the entire column."""
return spend.mean()


def spend_zero_mean(spend: pd.Series, spend_mean: float) -> pd.Series:
"""Shows function that takes a scalar. In this case to zero mean spend."""
return spend - spend_mean


def spend_std_dev(spend: pd.Series) -> float:
"""Function that computes the standard deviation of the spend column."""
return spend.std()


def spend_zero_mean_unit_variance(spend_zero_mean: pd.Series, spend_std_dev: float) -> pd.Series:
"""Function showing one way to make spend have zero mean and unit variance."""
return spend_zero_mean / spend_std_dev
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Cache hook
This hook uses the [diskcache](https://grantjenks.com/docs/diskcache/tutorial.html) to cache node execution on disk. The cache key is a tuple of the function's
# DiskCache Adapter
This adapter uses [diskcache](https://grantjenks.com/docs/diskcache/tutorial.html) to cache node execution on disk. The cache key is a tuple of the function's
`(source code, input a, ..., input n)`. This means, a function will only be executed once for a given set of inputs,
and source code hash. The cache is stored in a directory of your choice, and it can be shared across different runs of your
code. That way as you develop, if the inputs and the code haven't changed, the function will not be executed again and
Expand All @@ -16,7 +16,7 @@ Disk cache has great features to:
> cache (both keys and values). Learn more about [caveats](https://grantjenks.com/docs/diskcache/tutorial.html#caveats).
> ❓ To store artifacts robustly, please use Hamilton materializers or the
> [CachingGraphAdapter](https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes) instead.
> [CachingGraphAdapter](https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes/caching_graph_adatper) instead.
> The `CachingGraphAdapter` stores tagged nodes directly on the file system using common formats (JSON, CSV, Parquet, etc.).
> However, it isn't aware of your function version and requires you to manually manage your disk space.
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 comments on commit 2cfe00c

Please sign in to comment.