Open
Description
Is your feature request related to a problem?
The defaults for concat
are excessively permissive: data_vars="all", coords="different", compat="no_conflicts", join="outer"
. This comment illustrates why this can be hard to predict or understand: a seemingly unrelated option decode_cf
controls whether a variable is in data_vars
or coords
, and can result in wildly different concatenation behaviour.
- This always concatenates data_vars along
concat_dim
even if they did not have that dimension to begin with. - If the same coordinate var exists in different datasets/files, they will be sequentially compared for equality to decide whether they get concatenated.
- The outer join (applied along all dimensions that are not
concat_dim
) can result in very large datasets due to small floating points differences in the indexes, and also questionable behaviour with staggered grid datasets. - "no_conflicts" basically picks the first not-NaN value after aligning all datasets, but is quite slow (we should be using
duck_array_ops.nanfirst
here I think).
While "convenient" this really just makes the default experience quite bad with hard-to-understand slowdowns.
Describe the solution you'd like
I propose we migrate to data_vars="minimal", coords="minimal", join="exact", compat="override"
. This should
- only concatenate
data_vars
andcoords
variables when they already haveconcat_dim
. - For any variables that do not have
concat_dim
, it will blindly pick them from the first file. join="exact"
will prevent ballooning of dimension sizes due to floating point inequalities.- These options will totally avoid any data reads unless explicitly requested by the user.
Unfortunately, this has a pretty big blast radius so we'd need a long deprecation cycle.
Describe alternatives you've considered
No response
Additional context
xref #4824
xref #1385
xref #8231
xref #5381
xref #2064
xref #2217