Skip to content

New defaults for concat, merge, combine_* #10062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
5c56acf
Remove default values in private functions
jsignell Feb 14, 2025
5461a9f
Use sentinel value to change default with warnings
jsignell Feb 24, 2025
e16834f
Remove unnecessary warnings
jsignell Feb 24, 2025
9c50125
Use old kwarg values within map_blocks, concat dataarray
jsignell Feb 25, 2025
b0cf17a
Merge branch 'main' into concat_default_kwargs
jsignell Feb 25, 2025
0026ee8
Switch options back to old defaults
jsignell Feb 26, 2025
4d4deda
Update tests and add new ones to exercise options
jsignell Feb 26, 2025
5a4036b
Merge branch 'main' into concat_default_kwargs
jsignell Mar 4, 2025
912638b
Use `emit_user_level_warning` rather than `warnings.warn`
jsignell Mar 4, 2025
67fd4ff
Change hardcoded defaults
jsignell Mar 4, 2025
4f38292
Fix up test_concat
jsignell Mar 4, 2025
51ccc89
Add comment about why we allow data_vars='minimial' for concat over d…
jsignell Mar 4, 2025
aa3180e
Tidy up tests based on review
jsignell Mar 4, 2025
93d2abc
Merge branch 'main' into concat_default_kwargs
jsignell Mar 7, 2025
e517dcc
Trying to resolve mypy issues
jsignell Mar 10, 2025
0e678e5
Fix mypy in tests
jsignell Mar 10, 2025
37f0147
Fix doctests
jsignell Mar 10, 2025
dac337c
Ignore warnings on error tests
jsignell Mar 10, 2025
a0c16c3
Merge branch 'main' into concat_default_kwargs
jsignell Mar 13, 2025
4eb275c
Use typing.get_args when possible
jsignell Mar 13, 2025
03f1502
Allow `minimal` in concat options at the type level
jsignell Mar 13, 2025
f1649b8
Merge branch 'main' into concat_default_kwargs
dcherian Mar 13, 2025
7dbdd4a
Minimal docs update
jsignell Mar 13, 2025
c6a557b
Tighten up language
jsignell Mar 13, 2025
9667857
Merge branch 'main' into concat_default_kwargs
jsignell Mar 13, 2025
42cf522
Merge branch 'main' into concat_default_kwargs
jsignell Mar 17, 2025
8d0d390
Merge branch 'main' into concat_default_kwargs
jsignell Apr 18, 2025
ba45599
Add to deprecated section of whats new
jsignell Apr 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 37 additions & 7 deletions doc/user-guide/combining.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ new dimension by stacking lower dimensional arrays together:

.. ipython:: python

da.sel(x="a")
xr.concat([da.isel(x=0), da.isel(x=1)], "x")

If the second argument to ``concat`` is a new dimension name, the arrays will
Expand All @@ -52,15 +51,18 @@ dimension:

.. ipython:: python

xr.concat([da.isel(x=0), da.isel(x=1)], "new_dim")
da0 = da.isel(x=0).drop_vars("x")
da1 = da.isel(x=1).drop_vars("x")

xr.concat([da0, da1], "new_dim")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping the overlapping "x" means that you don't get a future warning anymore and the outcome won't change with the new defaults. It seemed to me like it was maintaining the spirit of the docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change to xr.concat([da.isel(x=[0]), da.isel(x=[1])], dim="new_dim"). I think that preserves the spirit, and gets users closer to what we'd like them to type and understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one will give a FutureWarning about how join is going to change:

In [3]:  xr.concat([da.isel(x=[0]), da.isel(x=[1])], "new_dim")
<ipython-input-3-8d3fee24c8e4>:1: FutureWarning: In a future version of xarray the default value for join will change from join='outer' to join='exact'. This change will result in the following ValueError:cannot be aligned with join='exact' because index/labels/sizes are not equal along these coordinates (dimensions): 'x' ('x',) The recommendation is to set join explicitly for this case.
  xr.concat([da.isel(x=[0]), da.isel(x=[1])], "new_dim")
Out[3]: 
<xarray.DataArray (new_dim: 2, x: 2, y: 3)> Size: 96B
array([[[ 0.,  1.,  2.],
        [nan, nan, nan]],

       [[nan, nan, nan],
        [ 3.,  4.,  5.]]])
Coordinates:
  * x        (x) <U1 8B 'a' 'b'
  * y        (y) int64 24B 10 20 30
Dimensions without coordinates: new_dim

We can add an explicit join value to get rid of the warning or we can allow the docs to build with the warning (I think that is not a good idea because warnings in docs might scare people)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared with that with the example as it is on main:

In [3]:  xr.concat([da.isel(x=0), da.isel(x=1)], "new_dim")
<ipython-input-8-5e17a4052d18>:1: FutureWarning: In a future version of xarray the default value for coords will change from coords='different' to coords='minimal'. This is likely to lead to different results when multiple datasets have matching variables with overlapping values. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set coords explicitly.
  xr.concat([da.isel(x=0), da.isel(x=1)], "new_dim")
Out[3]: 
<xarray.DataArray (new_dim: 2, y: 3)> Size: 48B
array([[0, 1, 2],
       [3, 4, 5]])
Coordinates:
    x        (new_dim) <U1 8B 'a' 'b'
  * y        (y) int64 24B 10 20 30
Dimensions without coordinates: new_dim


The second argument to ``concat`` can also be an :py:class:`~pandas.Index` or
:py:class:`~xarray.DataArray` object as well as a string, in which case it is
used to label the values along the new dimension:

.. ipython:: python

xr.concat([da.isel(x=0), da.isel(x=1)], pd.Index([-90, -100], name="new_dim"))
xr.concat([da0, da1], pd.Index([-90, -100], name="new_dim"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.


Of course, ``concat`` also works on ``Dataset`` objects:

Expand All @@ -75,6 +77,12 @@ between datasets. With the default parameters, xarray will load some coordinate
variables into memory to compare them between datasets. This may be prohibitively
expensive if you are manipulating your dataset lazily using :ref:`dask`.

.. note::

In a future version of xarray the default values for many of these options
will change. You can opt into the new default values early using
``xr.set_options(use_new_combine_kwarg_defaults=True)``.

.. _merge:

Merge
Expand All @@ -94,10 +102,18 @@ If you merge another dataset (or a dictionary including data array objects), by
default the resulting dataset will be aligned on the **union** of all index
coordinates:

.. note::

In a future version of xarray the default value for ``join`` and ``compat``
will change. This change will mean that xarray will no longer attempt
to align the indices of the merged dataset. You can opt into the new default
values early using ``xr.set_options(use_new_combine_kwarg_defaults=True)``.
Or explicitly set ``join='outer'`` to preserve old behavior.

.. ipython:: python

other = xr.Dataset({"bar": ("x", [1, 2, 3, 4]), "x": list("abcd")})
xr.merge([ds, other])
xr.merge([ds, other], join="outer")

This ensures that ``merge`` is non-destructive. ``xarray.MergeError`` is raised
if you attempt to merge two variables with the same name but different values:
Expand All @@ -114,6 +130,16 @@ if you attempt to merge two variables with the same name but different values:
array([[ 1.4691123 , 0.71713666, -0.5090585 ],
[-0.13563237, 2.21211203, 0.82678535]])

.. note::

In a future version of xarray the default value for ``compat`` will change
from ``compat='no_conflicts'`` to ``compat='override'``. In this scenario
the values in the first object override all the values in other objects.

.. ipython:: python

xr.merge([ds, ds + 1], compat="override")

The same non-destructive merging between ``DataArray`` index coordinates is
used in the :py:class:`~xarray.Dataset` constructor:

Expand Down Expand Up @@ -144,6 +170,11 @@ For datasets, ``ds0.combine_first(ds1)`` works similarly to
there are conflicting values in variables to be merged, whereas
``.combine_first`` defaults to the calling object's values.

.. note::

In a future version of xarray the default options for ``xr.merge`` will change
such that the behavior matches ``combine_first``.

.. _update:

Update
Expand Down Expand Up @@ -236,7 +267,7 @@ coordinates as long as any non-missing values agree or are disjoint:

ds1 = xr.Dataset({"a": ("x", [10, 20, 30, np.nan])}, {"x": [1, 2, 3, 4]})
ds2 = xr.Dataset({"a": ("x", [np.nan, 30, 40, 50])}, {"x": [2, 3, 4, 5]})
xr.merge([ds1, ds2], compat="no_conflicts")
xr.merge([ds1, ds2], join="outer", compat="no_conflicts")

Note that due to the underlying representation of missing values as floating
point numbers (``NaN``), variable data type is not always preserved when merging
Expand Down Expand Up @@ -295,13 +326,12 @@ they are concatenated in order based on the values in their dimension
coordinates, not on their position in the list passed to ``combine_by_coords``.

.. ipython:: python
:okwarning:

x1 = xr.DataArray(name="foo", data=np.random.randn(3), coords=[("x", [0, 1, 2])])
x2 = xr.DataArray(name="foo", data=np.random.randn(3), coords=[("x", [3, 4, 5])])
xr.combine_by_coords([x2, x1])

These functions can be used by :py:func:`~xarray.open_mfdataset` to open many
These functions are used by :py:func:`~xarray.open_mfdataset` to open many
files as one dataset. The particular function used is specified by setting the
argument ``'combine'`` to ``'by_coords'`` or ``'nested'``. This is useful for
situations where your data is split across many files in multiple locations,
Expand Down
2 changes: 1 addition & 1 deletion doc/user-guide/terminology.rst
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ complete examples, please consult the relevant documentation.*
)

# combine the datasets
combined_ds = xr.combine_by_coords([ds1, ds2])
combined_ds = xr.combine_by_coords([ds1, ds2], join="outer")
combined_ds

lazy
Expand Down
23 changes: 18 additions & 5 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,15 @@ Breaking changes
Deprecations
~~~~~~~~~~~~

- Start deprecation cycle for changing the default keyword arguments to ``concat``, ``merge``, ``combine``, ``open_mfdataset``.
Emits a ``FutureWarning`` when using old defaults and new defaults would result in different behavior.
Adds an option: ``use_new_combine_kwarg_defaults`` to opt in to new defaults immediately.
New values are:
- ``data_vars``: "minimal"
- ``coords``: "minimal"
- ``compat``: "override"
- ``join``: "exact"
By `Julia Signell <https://github.com/jsignell>`_.

Bug fixes
~~~~~~~~~
Expand Down Expand Up @@ -8028,13 +8037,17 @@ Backwards incompatible changes
Now, the default always concatenates data variables:

.. ipython:: python
:suppress:

ds = xray.Dataset({"x": 0})
:verbatim:

.. ipython:: python
In [1]: ds = xray.Dataset({"x": 0})

xray.concat([ds, ds], dim="y")
In [2]: xray.concat([ds, ds], dim="y")
Out[2]:
<xarray.Dataset> Size: 16B
Dimensions: (y: 2)
Dimensions without coordinates: y
Data variables:
x (y) int64 16B 0 0

To obtain the old behavior, supply the argument ``concat_over=[]``.

Expand Down
25 changes: 16 additions & 9 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
)
from xarray.backends.locks import _get_scheduler
from xarray.coders import CFDatetimeCoder, CFTimedeltaCoder
from xarray.core import indexing
from xarray.core import dtypes, indexing
from xarray.core.dataarray import DataArray
from xarray.core.dataset import Dataset
from xarray.core.datatree import DataTree
Expand All @@ -50,6 +50,13 @@
_nested_combine,
combine_by_coords,
)
from xarray.util.deprecation_helpers import (
_COMPAT_DEFAULT,
_COORDS_DEFAULT,
_DATA_VARS_DEFAULT,
_JOIN_DEFAULT,
CombineKwargDefault,
)

if TYPE_CHECKING:
try:
Expand Down Expand Up @@ -1404,14 +1411,16 @@ def open_mfdataset(
| Sequence[Index]
| None
) = None,
compat: CompatOptions = "no_conflicts",
compat: CompatOptions | CombineKwargDefault = _COMPAT_DEFAULT,
preprocess: Callable[[Dataset], Dataset] | None = None,
engine: T_Engine | None = None,
data_vars: Literal["all", "minimal", "different"] | list[str] = "all",
coords="different",
data_vars: Literal["all", "minimal", "different"]
| list[str]
| CombineKwargDefault = _DATA_VARS_DEFAULT,
coords=_COORDS_DEFAULT,
combine: Literal["by_coords", "nested"] = "by_coords",
parallel: bool = False,
join: JoinOptions = "outer",
join: JoinOptions | CombineKwargDefault = _JOIN_DEFAULT,
attrs_file: str | os.PathLike | None = None,
combine_attrs: CombineAttrsOptions = "override",
**kwargs,
Expand Down Expand Up @@ -1598,9 +1607,6 @@ def open_mfdataset(

paths1d: list[str | ReadBuffer]
if combine == "nested":
if isinstance(concat_dim, str | DataArray) or concat_dim is None:
concat_dim = [concat_dim] # type: ignore[assignment]

# This creates a flat list which is easier to iterate over, whilst
# encoding the originally-supplied structure as "ids".
# The "ids" are not used at all if combine='by_coords`.
Expand Down Expand Up @@ -1649,13 +1655,14 @@ def open_mfdataset(
# along each dimension, using structure given by "ids"
combined = _nested_combine(
datasets,
concat_dims=concat_dim,
concat_dim=concat_dim,
compat=compat,
data_vars=data_vars,
coords=coords,
ids=ids,
join=join,
combine_attrs=combine_attrs,
fill_value=dtypes.NA,
)
elif combine == "by_coords":
# Redo ordering from coordinates, ignoring how they were ordered
Expand Down
21 changes: 17 additions & 4 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,13 @@
merge_coordinates_without_align,
merge_data_and_coords,
)
from xarray.util.deprecation_helpers import _deprecate_positional_args, deprecate_dims
from xarray.util.deprecation_helpers import (
_COMPAT_DEFAULT,
_JOIN_DEFAULT,
CombineKwargDefault,
_deprecate_positional_args,
deprecate_dims,
)

if TYPE_CHECKING:
from dask.dataframe import DataFrame as DaskDataFrame
Expand Down Expand Up @@ -5279,7 +5285,14 @@ def stack_dataarray(da):

# concatenate the arrays
stackable_vars = [stack_dataarray(da) for da in self.data_vars.values()]
data_array = concat(stackable_vars, dim=new_dim)
data_array = concat(
stackable_vars,
dim=new_dim,
data_vars="all",
coords="different",
compat="equals",
join="outer",
)

if name is not None:
data_array.name = name
Expand Down Expand Up @@ -5523,8 +5536,8 @@ def merge(
self,
other: CoercibleMapping | DataArray,
overwrite_vars: Hashable | Iterable[Hashable] = frozenset(),
compat: CompatOptions = "no_conflicts",
join: JoinOptions = "outer",
compat: CompatOptions | CombineKwargDefault = _COMPAT_DEFAULT,
join: JoinOptions | CombineKwargDefault = _JOIN_DEFAULT,
fill_value: Any = xrdtypes.NA,
combine_attrs: CombineAttrsOptions = "override",
) -> Self:
Expand Down
18 changes: 16 additions & 2 deletions xarray/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1608,7 +1608,14 @@ def _combine(self, applied, shortcut=False):
if shortcut:
combined = self._concat_shortcut(applied, dim, positions)
else:
combined = concat(applied, dim)
combined = concat(
applied,
dim,
data_vars="all",
coords="different",
compat="equals",
join="outer",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hard-coded these to the old defaults since there is no way for the user to set them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this approach. These options result in confusing groupby behaviour (#2145) but we can tackle that later

combined = _maybe_reorder(combined, dim, positions, N=self.group1d.size)

if isinstance(combined, type(self._obj)):
Expand Down Expand Up @@ -1768,7 +1775,14 @@ def _combine(self, applied):
"""Recombine the applied objects like the original."""
applied_example, applied = peek_at(applied)
dim, positions = self._infer_concat_args(applied_example)
combined = concat(applied, dim)
combined = concat(
applied,
dim,
data_vars="all",
coords="different",
compat="equals",
join="outer",
)
combined = _maybe_reorder(combined, dim, positions, N=self.group1d.size)
# assign coord when the applied function does not return that coord
if dim not in applied_example.dims:
Expand Down
13 changes: 13 additions & 0 deletions xarray/core/options.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
"keep_attrs",
"warn_for_unclosed_files",
"use_bottleneck",
"use_new_combine_kwarg_defaults",
"use_numbagg",
"use_opt_einsum",
"use_flox",
Expand Down Expand Up @@ -57,6 +58,7 @@ class T_Options(TypedDict):
warn_for_unclosed_files: bool
use_bottleneck: bool
use_flox: bool
use_new_combine_kwarg_defaults: bool
use_numbagg: bool
use_opt_einsum: bool

Expand Down Expand Up @@ -84,6 +86,7 @@ class T_Options(TypedDict):
"warn_for_unclosed_files": False,
"use_bottleneck": True,
"use_flox": True,
"use_new_combine_kwarg_defaults": False,
"use_numbagg": True,
"use_opt_einsum": True,
}
Expand Down Expand Up @@ -113,6 +116,7 @@ def _positive_integer(value: Any) -> bool:
"file_cache_maxsize": _positive_integer,
"keep_attrs": lambda choice: choice in [True, False, "default"],
"use_bottleneck": lambda value: isinstance(value, bool),
"use_new_combine_kwarg_defaults": lambda value: isinstance(value, bool),
"use_numbagg": lambda value: isinstance(value, bool),
"use_opt_einsum": lambda value: isinstance(value, bool),
"use_flox": lambda value: isinstance(value, bool),
Expand Down Expand Up @@ -250,6 +254,15 @@ class set_options:
use_flox : bool, default: True
Whether to use ``numpy_groupies`` and `flox`` to
accelerate groupby and resampling reductions.
use_new_combine_kwarg_defaults : bool, default False
Whether to use new kwarg default values for combine functions:
:py:func:`~xarray.concat`, :py:func:`~xarray.merge`,
:py:func:`~xarray.open_mfdataset`. New values are:

* ``data_vars``: "minimal"
* ``coords``: "minimal"
* ``compat``: "override"
* ``join``: "exact"
use_numbagg : bool, default: True
Whether to use ``numbagg`` to accelerate reductions.
Takes precedence over ``use_bottleneck`` when both are True.
Expand Down
14 changes: 11 additions & 3 deletions xarray/core/parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,9 @@ def _wrapper(
result = func(*converted_args, **kwargs)

merged_coordinates = merge(
[arg.coords for arg in args if isinstance(arg, Dataset | DataArray)]
[arg.coords for arg in args if isinstance(arg, Dataset | DataArray)],
join="exact",
compat="override",
).coords

# check all dims are present
Expand Down Expand Up @@ -439,7 +441,11 @@ def _wrapper(
# rechunk any numpy variables appropriately
xarray_objs = tuple(arg.chunk(arg.chunksizes) for arg in xarray_objs)

merged_coordinates = merge([arg.coords for arg in aligned]).coords
merged_coordinates = merge(
[arg.coords for arg in aligned],
join="exact",
compat="override",
).coords

_, npargs = unzip(
sorted(
Expand Down Expand Up @@ -472,7 +478,9 @@ def _wrapper(
)

coordinates = merge(
(preserved_coords, template.coords.to_dataset()[new_coord_vars])
(preserved_coords, template.coords.to_dataset()[new_coord_vars]),
join="outer",
compat="override",
).coords
output_chunks: Mapping[Hashable, tuple[int, ...]] = {
dim: input_chunks[dim] for dim in template.dims if dim in input_chunks
Expand Down
Loading
Loading