Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow different PartitionsDefinitions within an AssetsDefinition #23824

Merged
merged 1 commit into from
Dec 19, 2024

Conversation

sryza
Copy link
Contributor

@sryza sryza commented Aug 22, 2024

Summary & Motivation

This makes it possible for assets within a single AssetsDefinition to have different PartitionsDefinitions. I.e. the example below is now allowed:

from dagster import AssetSpec, MaterializeResult, StaticPartitionsDefinition, multi_asset

partitions_def1 = StaticPartitionsDefinition(["a", "b", "c", "d"])
partitions_def2 = StaticPartitionsDefinition(["1", "2", "3"])


@multi_asset(
    specs=[
        AssetSpec("my_asset_1", partitions_def=partitions_def1),
        AssetSpec("my_asset_2", partitions_def=partitions_def2),
    ],
    can_subset=True,
)
def my_assets(context):
    for asset_key in context.selected_asset_keys:
        yield MaterializeResult(asset_key=asset_key)

For this to be allowed, can_subset must be true. This means that we don't need to change anything about the underlying execution. If someone wants to launch an execution that targets two assets with different PartitionsDefinitions within the same AssetsDefinition, then the assets are just executed in separate runs.

This will help address a common complaint we've received with respect to our dbt integration, which is that you need to do a separate @dbt_assets(select=...) per PartitionsDefinition.

How I Tested These Changes

  • Added unit tests.
  • dagster dev with the example above, launched runs, clicked around the launchpad, looked at the asset details page.

Changelog

AssetSpecs may now contain a partitions_def. Different AssetSpecs passed to the same invocation of @multi_asset can now have different PartitionsDefinitions, as long as can_subset=True.

@sryza sryza force-pushed the assets-def-different-partitions-defs branch 4 times, most recently from c44fe49 to a0a0d47 Compare September 6, 2024 17:40
@sryza sryza force-pushed the assets-def-different-partitions-defs branch from a0a0d47 to 75b569c Compare September 6, 2024 20:07
@sryza sryza changed the base branch from master to mv-step-context-partitions-def-methods September 6, 2024 20:07
@sryza sryza force-pushed the mv-step-context-partitions-def-methods branch from 5765906 to ba81226 Compare September 6, 2024 22:06
@sryza sryza force-pushed the assets-def-different-partitions-defs branch from 75b569c to dee20d5 Compare September 6, 2024 22:06
@sryza sryza force-pushed the mv-step-context-partitions-def-methods branch from ba81226 to d44afc0 Compare September 6, 2024 22:15
@sryza sryza force-pushed the assets-def-different-partitions-defs branch 2 times, most recently from af8f164 to 6ff1131 Compare September 6, 2024 23:42
Copy link
Contributor

@OwenKephart OwenKephart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you create an invalid run that targets multiple assets within the multi asset which have different partitions defs?

I assume it just gets caught by the same machinery that would catch this in the non-multi-asset case, but might be nice to have a unit test there.

Otherwise I think this seems great -- just commenting instead of approving as I'm sure others will want to weigh in.

Long-term-vision-wise, I'm imagining a world where a run can have an AssetGraphSubset attached to it and so we could have a context.get_partition_subset(asset_key) to figure out the specific partitions for a given asset or something, which this fits fine with.

@sryza
Copy link
Contributor Author

sryza commented Sep 10, 2024

What happens if you create an invalid run that targets multiple assets within the multi asset which have different partitions defs?

Oo good question. Added a block in a test:

    with pytest.raises(
        DagsterInvalidDefinitionError,
        match="Selected assets must have the same partitions definitions, but the selected assets ",
    ):
        materialize(assets=[my_assets], partition_key="b")

Long-term-vision-wise, I'm imagining a world where a run can have an AssetGraphSubset attached to it and so we could have a context.get_partition_subset(asset_key) to figure out the specific partitions for a given asset or something, which this fits fine with.

Totally

@sryza sryza force-pushed the assets-def-different-partitions-defs branch from 6ff1131 to c58bf99 Compare September 10, 2024 20:55
@sryza sryza force-pushed the mv-step-context-partitions-def-methods branch from d44afc0 to 9444922 Compare September 10, 2024 22:11
sryza added a commit that referenced this pull request Sep 11, 2024
## Summary & Motivation

This is a non-behavior-changing PR that refactors some methods within
the internal execution context classes. This aim of this PR is to
prepare for #23824, by
centralizing the logic for getting the `PartitionsDefinition` for an
execution.

Working on this made me aware of a behavioral weirdness that I think we
should consider addressing in 1.9: `context.partition_keys` etc. will
return a non-`None` value in steps that target no partitioned assets, if
other steps in the same run do target partitioned assets.

## How I Tested These Changes

## Changelog

NOCHANGELOG
Base automatically changed from mv-step-context-partitions-def-methods to master September 11, 2024 17:06
Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Queue management.

I think should include or have upstack a motivating example that would demonstrate how this would be used in the dbt use base.

@sryza
Copy link
Contributor Author

sryza commented Sep 25, 2024

@schrockn I'll put together that example, but FYI this is in a reviewable state.

@schrockn
Copy link
Member

Given that the selected assets must be the same partition definition, isn't it more straightforward/better for now (for the consumer at least, the multi_asset splitting would be complicated)) to create N multi_assets split up by N partition definitions? This makes it so that we still handle all arbitrary subset selections (which this scheme does not allow) but "gets the job done" for now for the dbt use case.

Long story short, I'm asking if we should wait to support this until we implement "Long-term-vision-wise, I'm imagining a world where a run can have an AssetGraphSubset attached to it and so we could have a context.get_partition_subset(asset_key) to figure out the specific partitions for a given asset or something, which this fits fine with." as @OwenKephart describes.

@sryza
Copy link
Contributor Author

sryza commented Sep 25, 2024

Here's a change we'd be able to make in purina if this (and related dbt support) were merged: https://github.com/dagster-io/internal/pull/11784.

isn't it more straightforward/better for now (for the consumer at least, the multi_asset splitting would be complicated)) to create N multi_assets split up by N partition definitions?

If I understand correctly, you mean don't make framework changes right now, but instead update the dbt integration etc. to create multiple multi-assets instead of just one? A couple limitations with that approach:

  • The @dbt_assets decorator currently produces a single AssetsDefinition object, so we don't have a straightforward way of returning multiple multi-assets. If we switched to my_dbt.build_defs, this would be less of a problem in the default case, but opens the question of what customization looks like.
  • Currently, non-partitioned assets and partitioned assets can be executed as part of a single Dagster run, as long as all the partitioned assets have the same partitioning. If this PR were to be merged, it would be possible to do that within a single step.

This makes it so that we still handle all arbitrary subset selections (which this scheme does not allow) but "gets the job done" for now for the dbt use case.

Are you able to say more about what you mean by "handle"? It will still be possible, after this PR, to select a set of differently-partitioned assets that have the same AssetsDefinition and click the Materialize button in the UI.

That all said, I don't think there are mega-pressing reasons we need to try to merge this now, if we have reservations about it. The two reasons I had been pressing on this:

  • Helps with a broader refactor I had been working on for how we represent asset definitions internally. However I don't have much bandwidth to move this forward in the near term.
  • Complaints from Nick Roach about using our dbt integration.

@schrockn
Copy link
Member

Are you able to say more about what you mean by "handle"? It will still be possible, after this PR, to select a set of differently-partitioned assets that have the same AssetsDefinition and click the Materialize button in the UI.

Screenshot 2024-09-26 at 12 56 25 PM

Hmmm I am getting confused by this sentence and and attached screenshot.

And yes for sure https://github.com/dagster-io/internal/pull/11784 is super motivating.

That all said, I don't think there are mega-pressing reasons we need to try to merge this now, if we have reservations about it.

I think this is worthy workstream. But the framework change is a big one. Just asking questions about alternatives.

And yeah I'm imagining a new entry point for this advanced use case that produces multiple AssetsDefinitions. Obviously downsides there in terms of surface area.

@sryza
Copy link
Contributor Author

sryza commented Sep 26, 2024

Hmmm I am getting confused by this sentence and and attached screenshot.

Ah yes reasonable confusion:

  • When you click the "Materialize" button in the UI in this situation, the execution will be launched as a backfill.
  • When you invoke materialize in Python, it will fail, because materialize can only launch a single run.

This is equivalent to the experience you'd get today if the assets were split across multiple AssetsDefinitions.

In an ideal future, the materialize function would be able to carry out execution that touches assets with different partitions (either by orchestrating an in-process backfill or via a single run that can handle different partitions).

Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for working through this one. Supportive of this!

I just have one outstanding question. This., IMO, is a very "advanced" API likely to be used by a small number of sophisticated users and integration authors.

Should we make this a "builder" method on AssetSpec similar to with_io_manager? I'm concerned about leading mainstream users astray with multi_asset, asset, and AssetSpec having the same variable name and causing a bunch of confusion when there are conflicts or whatever.

We could even make the builder method have a more specific name to very explicitly say its use case.

with_asset_specific_partitions_def or something (even though it is redundant admittedly)

I have a long docblock explaining its usage.

We would need some sort of principal around when to make stuff "builder methods" versus top-level property but just trying to solve the short-term problem.

@OwenKephart OwenKephart force-pushed the assets-def-different-partitions-defs branch from c58bf99 to e5c69f3 Compare December 19, 2024 19:18
Copy link
Contributor

OwenKephart commented Dec 19, 2024

@OwenKephart
Copy link
Contributor

@schrockn

Should we make this a "builder" method on AssetSpec similar to with_io_manager? I'm concerned about leading mainstream users astray with multi_asset, asset, and AssetSpec having the same variable name and causing a bunch of confusion when there are conflicts or whatever.

I think it's harder to do the builder approach with this particular parameter vis-a-vis with_io_manager, because partitions_def is a proper attribute of AssetSpec, whereas io_manager_key is just a fancy tag.

I agree that there's a risk of confusion when people are creating multi_assets etc., but in practice because it is a real property I think it might be even more confusing if we try to hide that parameter and force you to use a separate method

@OwenKephart OwenKephart force-pushed the assets-def-different-partitions-defs branch from e5c69f3 to f490efa Compare December 19, 2024 19:42
@OwenKephart OwenKephart requested a review from schrockn December 19, 2024 19:42
@schrockn
Copy link
Member

I think we should move forward with this. However to talk this out this does add multiple ways of doing things as well as invalid parameter combinations that are only caught at runtime.

In particular:

@multi_asset(specs=[
    spec.replace_attributes(partition_def=some_partition_def) 
    for spec in [AssetSpec("asset_one"), AssetSpec("asset_two")]
])
def an_asset() -> None: ...

and

@multi_asset(
    specs=[AssetSpec("asset_one"), AssetSpec("asset_two")], 
    partition_def=some_partition_def,
)
def an_asset() -> None: ...

I don't think it is worth deprecating partition_def on multi_asset but it would definitely feel more "pure."

That all being said, I support this path forward, just want to write out the tradeoffs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants