Skip to content

Stricter defaults for concat, combine, open_mfdataset, merge #8778

Open
@dcherian

Description

@dcherian

Is your feature request related to a problem?

The defaults for concat are excessively permissive: data_vars="all", coords="different", compat="no_conflicts", join="outer". This comment illustrates why this can be hard to predict or understand: a seemingly unrelated option decode_cf controls whether a variable is in data_vars or coords, and can result in wildly different concatenation behaviour.

  1. This always concatenates data_vars along concat_dim even if they did not have that dimension to begin with.
  2. If the same coordinate var exists in different datasets/files, they will be sequentially compared for equality to decide whether they get concatenated.
  3. The outer join (applied along all dimensions that are not concat_dim) can result in very large datasets due to small floating points differences in the indexes, and also questionable behaviour with staggered grid datasets.
  4. "no_conflicts" basically picks the first not-NaN value after aligning all datasets, but is quite slow (we should be using duck_array_ops.nanfirst here I think).

While "convenient" this really just makes the default experience quite bad with hard-to-understand slowdowns.

Describe the solution you'd like

I propose we migrate to data_vars="minimal", coords="minimal", join="exact", compat="override". This should

  1. only concatenate data_vars and coords variables when they already have concat_dim.
  2. For any variables that do not have concat_dim, it will blindly pick them from the first file.
  3. join="exact" will prevent ballooning of dimension sizes due to floating point inequalities.
  4. These options will totally avoid any data reads unless explicitly requested by the user.

Unfortunately, this has a pretty big blast radius so we'd need a long deprecation cycle.

Describe alternatives you've considered

No response

Additional context

xref #4824
xref #1385
xref #8231
xref #5381
xref #2064
xref #2217

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions