allow different `PartitionsDefinition`s within an `AssetsDefinition` #23824

sryza · 2024-08-22T15:07:46Z

Summary & Motivation

This makes it possible for assets within a single AssetsDefinition to have different PartitionsDefinitions. I.e. the example below is now allowed:

from dagster import AssetSpec, MaterializeResult, StaticPartitionsDefinition, multi_asset

partitions_def1 = StaticPartitionsDefinition(["a", "b", "c", "d"])
partitions_def2 = StaticPartitionsDefinition(["1", "2", "3"])


@multi_asset(
    specs=[
        AssetSpec("my_asset_1", partitions_def=partitions_def1),
        AssetSpec("my_asset_2", partitions_def=partitions_def2),
    ],
    can_subset=True,
)
def my_assets(context):
    for asset_key in context.selected_asset_keys:
        yield MaterializeResult(asset_key=asset_key)

For this to be allowed, can_subset must be true. This means that we don't need to change anything about the underlying execution. If someone wants to launch an execution that targets two assets with different PartitionsDefinitions within the same AssetsDefinition, then the assets are just executed in separate runs.

This will help address a common complaint we've received with respect to our dbt integration, which is that you need to do a separate @dbt_assets(select=...) per PartitionsDefinition.

How I Tested These Changes

Added unit tests.
dagster dev with the example above, launched runs, clicked around the launchpad, looked at the asset details page.

Changelog

AssetSpecs may now contain a partitions_def. Different AssetSpecs passed to the same invocation of @multi_asset can now have different PartitionsDefinitions, as long as can_subset=True.

OwenKephart

What happens if you create an invalid run that targets multiple assets within the multi asset which have different partitions defs?

I assume it just gets caught by the same machinery that would catch this in the non-multi-asset case, but might be nice to have a unit test there.

Otherwise I think this seems great -- just commenting instead of approving as I'm sure others will want to weigh in.

Long-term-vision-wise, I'm imagining a world where a run can have an AssetGraphSubset attached to it and so we could have a context.get_partition_subset(asset_key) to figure out the specific partitions for a given asset or something, which this fits fine with.

sryza · 2024-09-10T20:55:00Z

What happens if you create an invalid run that targets multiple assets within the multi asset which have different partitions defs?

Oo good question. Added a block in a test:

    with pytest.raises(
        DagsterInvalidDefinitionError,
        match="Selected assets must have the same partitions definitions, but the selected assets ",
    ):
        materialize(assets=[my_assets], partition_key="b")

Long-term-vision-wise, I'm imagining a world where a run can have an AssetGraphSubset attached to it and so we could have a context.get_partition_subset(asset_key) to figure out the specific partitions for a given asset or something, which this fits fine with.

Totally

## Summary & Motivation This is a non-behavior-changing PR that refactors some methods within the internal execution context classes. This aim of this PR is to prepare for #23824, by centralizing the logic for getting the `PartitionsDefinition` for an execution. Working on this made me aware of a behavioral weirdness that I think we should consider addressing in 1.9: `context.partition_keys` etc. will return a non-`None` value in steps that target no partitioned assets, if other steps in the same run do target partitioned assets. ## How I Tested These Changes ## Changelog NOCHANGELOG

schrockn

Queue management.

I think should include or have upstack a motivating example that would demonstrate how this would be used in the dbt use base.

sryza · 2024-09-25T17:36:08Z

@schrockn I'll put together that example, but FYI this is in a reviewable state.

schrockn · 2024-09-25T17:56:45Z

Given that the selected assets must be the same partition definition, isn't it more straightforward/better for now (for the consumer at least, the multi_asset splitting would be complicated)) to create N multi_assets split up by N partition definitions? This makes it so that we still handle all arbitrary subset selections (which this scheme does not allow) but "gets the job done" for now for the dbt use case.

Long story short, I'm asking if we should wait to support this until we implement "Long-term-vision-wise, I'm imagining a world where a run can have an AssetGraphSubset attached to it and so we could have a context.get_partition_subset(asset_key) to figure out the specific partitions for a given asset or something, which this fits fine with." as @OwenKephart describes.

sryza · 2024-09-25T22:35:59Z

Here's a change we'd be able to make in purina if this (and related dbt support) were merged: https://github.com/dagster-io/internal/pull/11784.

isn't it more straightforward/better for now (for the consumer at least, the multi_asset splitting would be complicated)) to create N multi_assets split up by N partition definitions?

If I understand correctly, you mean don't make framework changes right now, but instead update the dbt integration etc. to create multiple multi-assets instead of just one? A couple limitations with that approach:

The @dbt_assets decorator currently produces a single AssetsDefinition object, so we don't have a straightforward way of returning multiple multi-assets. If we switched to my_dbt.build_defs, this would be less of a problem in the default case, but opens the question of what customization looks like.
Currently, non-partitioned assets and partitioned assets can be executed as part of a single Dagster run, as long as all the partitioned assets have the same partitioning. If this PR were to be merged, it would be possible to do that within a single step.

This makes it so that we still handle all arbitrary subset selections (which this scheme does not allow) but "gets the job done" for now for the dbt use case.

Are you able to say more about what you mean by "handle"? It will still be possible, after this PR, to select a set of differently-partitioned assets that have the same AssetsDefinition and click the Materialize button in the UI.

That all said, I don't think there are mega-pressing reasons we need to try to merge this now, if we have reservations about it. The two reasons I had been pressing on this:

Helps with a broader refactor I had been working on for how we represent asset definitions internally. However I don't have much bandwidth to move this forward in the near term.
Complaints from Nick Roach about using our dbt integration.

schrockn · 2024-09-26T17:00:50Z

Are you able to say more about what you mean by "handle"? It will still be possible, after this PR, to select a set of differently-partitioned assets that have the same AssetsDefinition and click the Materialize button in the UI.

Hmmm I am getting confused by this sentence and and attached screenshot.

And yes for sure https://github.com/dagster-io/internal/pull/11784 is super motivating.

That all said, I don't think there are mega-pressing reasons we need to try to merge this now, if we have reservations about it.

I think this is worthy workstream. But the framework change is a big one. Just asking questions about alternatives.

And yeah I'm imagining a new entry point for this advanced use case that produces multiple AssetsDefinitions. Obviously downsides there in terms of surface area.

sryza · 2024-09-26T19:09:12Z

Hmmm I am getting confused by this sentence and and attached screenshot.

Ah yes reasonable confusion:

When you click the "Materialize" button in the UI in this situation, the execution will be launched as a backfill.
When you invoke materialize in Python, it will fail, because materialize can only launch a single run.

This is equivalent to the experience you'd get today if the assets were split across multiple AssetsDefinitions.

In an ideal future, the materialize function would be able to carry out execution that touches assets with different partitions (either by orchestrating an in-process backfill or via a single run that can handle different partitions).

schrockn

Ok thanks for working through this one. Supportive of this!

I just have one outstanding question. This., IMO, is a very "advanced" API likely to be used by a small number of sophisticated users and integration authors.

Should we make this a "builder" method on AssetSpec similar to with_io_manager? I'm concerned about leading mainstream users astray with multi_asset, asset, and AssetSpec having the same variable name and causing a bunch of confusion when there are conflicts or whatever.

We could even make the builder method have a more specific name to very explicitly say its use case.

with_asset_specific_partitions_def or something (even though it is redundant admittedly)

I have a long docblock explaining its usage.

We would need some sort of principal around when to make stuff "builder methods" versus top-level property but just trying to solve the short-term problem.

OwenKephart · 2024-12-19T19:19:11Z

[dagster-dbt][refactor] Use AssetOut.from_spec() to create AssetOuts for dbt_assets decorator #26613 : 2 dependent PRs (#26614 , #26625 )
allow different PartitionsDefinitions within an AssetsDefinition #23824 👈 (View in Graphite)
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

OwenKephart · 2024-12-19T19:30:15Z

@schrockn

Should we make this a "builder" method on AssetSpec similar to with_io_manager? I'm concerned about leading mainstream users astray with multi_asset, asset, and AssetSpec having the same variable name and causing a bunch of confusion when there are conflicts or whatever.

I think it's harder to do the builder approach with this particular parameter vis-a-vis with_io_manager, because partitions_def is a proper attribute of AssetSpec, whereas io_manager_key is just a fancy tag.

I agree that there's a risk of confusion when people are creating multi_assets etc., but in practice because it is a real property I think it might be even more confusing if we try to hide that parameter and force you to use a separate method

updated

python_modules/dagster/dagster/_core/definitions/assets.py

schrockn · 2024-12-19T20:06:03Z

I think we should move forward with this. However to talk this out this does add multiple ways of doing things as well as invalid parameter combinations that are only caught at runtime.

In particular:

@multi_asset(specs=[
    spec.replace_attributes(partition_def=some_partition_def) 
    for spec in [AssetSpec("asset_one"), AssetSpec("asset_two")]
])
def an_asset() -> None: ...

and

@multi_asset(
    specs=[AssetSpec("asset_one"), AssetSpec("asset_two")], 
    partition_def=some_partition_def,
)
def an_asset() -> None: ...

I don't think it is worth deprecating partition_def on multi_asset but it would definitely feel more "pure."

That all being said, I support this path forward, just want to write out the tradeoffs.

sryza force-pushed the assets-def-different-partitions-defs branch 4 times, most recently from c44fe49 to a0a0d47 Compare September 6, 2024 17:40

sryza mentioned this pull request Sep 6, 2024

refactor step context partition methods #24293

Merged

sryza force-pushed the assets-def-different-partitions-defs branch from a0a0d47 to 75b569c Compare September 6, 2024 20:07

sryza changed the base branch from master to mv-step-context-partitions-def-methods September 6, 2024 20:07

sryza requested review from schrockn, smackesey and OwenKephart September 6, 2024 21:00

sryza force-pushed the mv-step-context-partitions-def-methods branch from 5765906 to ba81226 Compare September 6, 2024 22:06

sryza force-pushed the assets-def-different-partitions-defs branch from 75b569c to dee20d5 Compare September 6, 2024 22:06

sryza force-pushed the mv-step-context-partitions-def-methods branch from ba81226 to d44afc0 Compare September 6, 2024 22:15

sryza force-pushed the assets-def-different-partitions-defs branch 2 times, most recently from af8f164 to 6ff1131 Compare September 6, 2024 23:42

OwenKephart reviewed Sep 9, 2024

View reviewed changes

sryza force-pushed the assets-def-different-partitions-defs branch from 6ff1131 to c58bf99 Compare September 10, 2024 20:55

sryza force-pushed the mv-step-context-partitions-def-methods branch from d44afc0 to 9444922 Compare September 10, 2024 22:11

Base automatically changed from mv-step-context-partitions-def-methods to master September 11, 2024 17:06

schrockn requested changes Sep 25, 2024

View reviewed changes

schrockn previously requested changes Sep 30, 2024

View reviewed changes

OwenKephart force-pushed the assets-def-different-partitions-defs branch from c58bf99 to e5c69f3 Compare December 19, 2024 19:18

OwenKephart force-pushed the assets-def-different-partitions-defs branch from e5c69f3 to f490efa Compare December 19, 2024 19:42

OwenKephart requested a review from schrockn December 19, 2024 19:42

schrockn reviewed Dec 19, 2024

View reviewed changes

python_modules/dagster/dagster/_core/definitions/assets.py Outdated Show resolved Hide resolved

schrockn approved these changes Dec 19, 2024

View reviewed changes

AssetSpec.partitions_def

f4f8a97

OwenKephart force-pushed the assets-def-different-partitions-defs branch from f490efa to f4f8a97 Compare December 19, 2024 21:10

This was referenced Dec 19, 2024

[dagster-dbt][refactor] Use AssetOut.from_spec() to create AssetOuts for dbt_assets decorator #26613

Merged

[dagster-dbt][refactor] Clean up multi-asset creation #26614

Merged

OwenKephart merged commit 234e10c into master Dec 19, 2024
1 check was pending

OwenKephart deleted the assets-def-different-partitions-defs branch December 19, 2024 21:33

This was referenced Dec 20, 2024

[dagster-dbt] Add get_partitions_def() method to DagsterDbtTranslator #26625

Merged

[dagster-dbt] Spec-ify dbt cloud integration #26662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow different `PartitionsDefinition`s within an `AssetsDefinition` #23824

allow different `PartitionsDefinition`s within an `AssetsDefinition` #23824

sryza commented Aug 22, 2024 •

edited by OwenKephart

Loading

OwenKephart left a comment

sryza commented Sep 10, 2024

schrockn left a comment

sryza commented Sep 25, 2024

schrockn commented Sep 25, 2024

sryza commented Sep 25, 2024

schrockn commented Sep 26, 2024

sryza commented Sep 26, 2024

schrockn left a comment

OwenKephart commented Dec 19, 2024 •

edited

Loading

OwenKephart commented Dec 19, 2024

schrockn commented Dec 19, 2024

allow different PartitionsDefinitions within an AssetsDefinition #23824

allow different PartitionsDefinitions within an AssetsDefinition #23824

Conversation

sryza commented Aug 22, 2024 • edited by OwenKephart Loading

Summary & Motivation

How I Tested These Changes

Changelog

OwenKephart left a comment

Choose a reason for hiding this comment

sryza commented Sep 10, 2024

schrockn left a comment

Choose a reason for hiding this comment

sryza commented Sep 25, 2024

schrockn commented Sep 25, 2024

sryza commented Sep 25, 2024

schrockn commented Sep 26, 2024

sryza commented Sep 26, 2024

schrockn left a comment

Choose a reason for hiding this comment

OwenKephart commented Dec 19, 2024 • edited Loading

OwenKephart commented Dec 19, 2024

schrockn commented Dec 19, 2024

allow different `PartitionsDefinition`s within an `AssetsDefinition` #23824

allow different `PartitionsDefinition`s within an `AssetsDefinition` #23824

sryza commented Aug 22, 2024 •

edited by OwenKephart

Loading

OwenKephart commented Dec 19, 2024 •

edited

Loading