How to preserve assets after subsequent materialisation #15386

PeterJCLaw · 2023-07-19T16:02:15Z

PeterJCLaw
Jul 19, 2023

This is picking up from a slack thread, prior to maybe becoming a feature request.

I'm trying to understand what Dagster's model is with regards to the history of assets.

I had initially assumed that each materialisation of an asset would be stored separately and that it would be possible (either indefinitely or perhaps within some time-bound) to refer to any version of materialised assets. Given that assets are things such as the output of a data pipeline, training data for a model or a trained ML model, combined with the effort which Dagster goes to around tracking lineage of assets and minimising refreshing of unchanged assets, it had felt obvious to me that historic versions would be available -- to ensure reproducibility of the training and enable comparison of model performance over time.

However it appears that instead Dagster simply overwrites assets on each materialisation. I haven't found anywhere in the docs that this is explicitly stated, so I'm not sure if there's something I'm missing which would either confirm that this is intentional or a setting somewhere which does what I'm after. There is a comment at #14733 (comment) which suggests that this overwrite behaviour is assumed, though even that seems to be in passing rather than definitive.

With some guidance in Slack I've had a little play with using a custom IO Manager which includes a run-id in the path an asset is saved to, however assets saved this way cannot subsequently be used downstream as the run-id is not available in the InputContext when triggering a materialisation of child asset. (I'm guessing that it might be available if I materialised several at the same time, however that would be a pretty substantial limitation to development speed if all materialisations then needed to refresh everything).

I've seen that some of this can be achieved through asset partitioning, however many of the use-cases I'm thinking about don't feel like they fit into a partitioned data model -- things like a trained model, variations of a model under different training configs, the set of training data it had, etc.

Is it possible to configure Dagster to keep all versions of assets? Or is it expected that outputs from a DAG will be manually exported somewhere for persistence? If the latter, are there any examples of how to do this in a manner which keeps track of where the asset came from (i.e: including all the inputs, config, etc.)?
Is there something I'm missing here about Dagster's asset model?

rgasper · 2023-08-10T14:38:46Z

rgasper
Aug 10, 2023

This is a feature I'm also keen on

0 replies

sryza · 2023-08-10T15:04:45Z

sryza
Aug 10, 2023

Hey @PeterJCLaw - that's accurate that Dagster expects assets to be overwritten by default. This is the default MO in data warehousing, because maintaining historical datasets would just require too much data storage in many contexts.

Dagster ultimately leaves it up to users to decide how assets are stored, either by handling storage directly, or by providing custom IO managers.

If you want to store all historical versions of your assets rather than overwriting them, an important question to answer is: "when reading the asset to materialize downstream assets, do you always want to read the latest materialized version?"

If the answer is "yes", to the above question, then here's a pattern that you can implement in the code you write that stores and reads your data (either in your IO manager or your @asset-decorated function, depending on your preference):

Include the run ID in the path where you store your data
When loading your data, query the DagsterInstance to find the run ID associated with the latest materialization of that asset:

latest_materialization_event = context.instance.get_latest_materialization_event(AssetKey("my_upstream_asset"))
latest_materialization_run_id = latest_materialization_event.run_id
path_to_load_from = f"blabla/{latest_materialization_run_id}/blabla"
...

If the answer is "no" to the above question, then you'll need to use partitions or config to determine where to read from.

1 reply

judahrand Aug 15, 2023

I wonder what you think of this solution, @sryza?

from typing import assert_never

import upath
import dagster
import upath.implementations.cloud
from dagster._core.errors import DagsterUndefinedDataVersionError
from dagster._core.execution.plan.execute_step import (
    _get_code_version,
    _get_input_provenance_data,
    compute_logical_data_version,
    extract_data_version_from_entry,
)


def get_logical_data_version(context: dagster.OutputContext | dagster.InputContext) -> dagster.DataVersion:
    match context:
        case dagster.OutputContext():
            code_version = _get_code_version(context.asset_key, context.step_context)
            input_provenance_data = _get_input_provenance_data(context.asset_key, context.step_context)
            return compute_logical_data_version(
                code_version,
                {k: meta["data_version"] for k, meta in input_provenance_data.items()},
            )

        case dagster.InputContext():
            # Use the latest version of the asset from the event log.
            instance = context.instance or context.step_context.instance
            if (data_version_record := instance.get_latest_data_version_record(context.asset_key)) is None:
                raise DagsterUndefinedDataVersionError()
            if (data_version := extract_data_version_from_entry(data_version_record.event_log_entry)) is None:
                raise DagsterUndefinedDataVersionError()
            return data_version

        case _ as unreachable:
            assert_never(unreachable)


class UPathIOManager(dagster.UPathIOManager):
    """
    A custom extension of Dagster's `UPathIOManager`.

    This extension allows us to:
        - persist the history of all assets stored.
    """
    def get_asset_relative_path(self, context: dagster.InputContext | dagster.OutputContext) -> upath.UPath:
        return super().get_asset_relative_path(context) / get_logical_data_version(context).value

The only issue with this solution is that it breaks when a user provides a manual Data Version override. This could be rectified if Dagster made the DataVersion provided by the user available in the OutputContext. Is that something that a PR might be accepted for? It would, I think, make the above solution fairly resilient.

I think that the above solution should also work with Dagster's memoization functionality by making the Asset storage content addressable?

d-goldin · 2025-02-03T20:01:00Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to preserve assets after subsequent materialisation #15386

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to preserve assets after subsequent materialisation #15386

PeterJCLaw Jul 19, 2023

Replies: 3 comments · 3 replies

rgasper Aug 10, 2023

sryza Aug 10, 2023

judahrand Aug 15, 2023

d-goldin Feb 3, 2025

isaac-jordan Feb 3, 2025

ion-elgreco Feb 7, 2025

PeterJCLaw
Jul 19, 2023

Replies: 3 comments 3 replies

rgasper
Aug 10, 2023

sryza
Aug 10, 2023

d-goldin
Feb 3, 2025