How to preserve assets after subsequent materialisation #15386
Replies: 3 comments 3 replies
-
This is a feature I'm also keen on |
Beta Was this translation helpful? Give feedback.
-
Hey @PeterJCLaw - that's accurate that Dagster expects assets to be overwritten by default. This is the default MO in data warehousing, because maintaining historical datasets would just require too much data storage in many contexts. Dagster ultimately leaves it up to users to decide how assets are stored, either by handling storage directly, or by providing custom IO managers. If you want to store all historical versions of your assets rather than overwriting them, an important question to answer is: "when reading the asset to materialize downstream assets, do you always want to read the latest materialized version?" If the answer is "yes", to the above question, then here's a pattern that you can implement in the code you write that stores and reads your data (either in your IO manager or your
latest_materialization_event = context.instance.get_latest_materialization_event(AssetKey("my_upstream_asset"))
latest_materialization_run_id = latest_materialization_event.run_id
path_to_load_from = f"blabla/{latest_materialization_run_id}/blabla"
... If the answer is "no" to the above question, then you'll need to use partitions or config to determine where to read from. |
Beta Was this translation helpful? Give feedback.
-
I'd be also quite interested in this, and it does seem like a pretty common feature required for lots of machine-learning pipelines/steps. Any updates/interesting finds since this was posted by anyone? |
Beta Was this translation helpful? Give feedback.
-
This is picking up from a slack thread, prior to maybe becoming a feature request.
I'm trying to understand what Dagster's model is with regards to the history of assets.
I had initially assumed that each materialisation of an asset would be stored separately and that it would be possible (either indefinitely or perhaps within some time-bound) to refer to any version of materialised assets. Given that assets are things such as the output of a data pipeline, training data for a model or a trained ML model, combined with the effort which Dagster goes to around tracking lineage of assets and minimising refreshing of unchanged assets, it had felt obvious to me that historic versions would be available -- to ensure reproducibility of the training and enable comparison of model performance over time.
However it appears that instead Dagster simply overwrites assets on each materialisation. I haven't found anywhere in the docs that this is explicitly stated, so I'm not sure if there's something I'm missing which would either confirm that this is intentional or a setting somewhere which does what I'm after. There is a comment at #14733 (comment) which suggests that this overwrite behaviour is assumed, though even that seems to be in passing rather than definitive.
With some guidance in Slack I've had a little play with using a custom IO Manager which includes a run-id in the path an asset is saved to, however assets saved this way cannot subsequently be used downstream as the run-id is not available in the
InputContext
when triggering a materialisation of child asset. (I'm guessing that it might be available if I materialised several at the same time, however that would be a pretty substantial limitation to development speed if all materialisations then needed to refresh everything).I've seen that some of this can be achieved through asset partitioning, however many of the use-cases I'm thinking about don't feel like they fit into a partitioned data model -- things like a trained model, variations of a model under different training configs, the set of training data it had, etc.
Is it possible to configure Dagster to keep all versions of assets? Or is it expected that outputs from a DAG will be manually exported somewhere for persistence? If the latter, are there any examples of how to do this in a manner which keeps track of where the asset came from (i.e: including all the inputs, config, etc.)?
Is there something I'm missing here about Dagster's asset model?
Beta Was this translation helpful? Give feedback.
All reactions