QuantileDeltaMapping per month seems to be significantly slower than adjusting the months individually #1747

saschahofmann · 2024-05-08T14:38:49Z

saschahofmann
May 8, 2024

Setup Information

Xclim version:
0.48.2

Context

I am using xclim's QuantileDeltaMapping for some bias adjustment and see significant performances differences if I use a monthly grouping (group='time.month') compared to performing the adjustment in a loop per month. I understand that there is some logic that adds an extra month for circularity but in my head that doesn't add up to the differences. In my real world problem, the loop is 16x faster than using group='time.month'. In my example below its only 2x.

I modified the example from the docs a little to create the data

import numpy as np
import xarray as xr


# Create toy data to explore bias adjustment, here fake temperature timeseries
t = xr.cftime_range("1990-01-01", "2030-12-31", freq="D", calendar="noleap")
lon = np.arange(0, 30, 0.5)
lat = np.arange(10, 40, 0.5)
rp = np.array([10, 20, 50, 100, 200, 500])
ref = xr.DataArray(
    (
        -20
        * np.cos(2 * np.pi * t.dayofyear / 365)[:, np.newaxis, np.newaxis, np.newaxis]
        + 2 * np.random.random_sample((t.size, lon.size, lat.size, rp.size))
        + 273.15
        + (0.1 * (t - t[0]).days.values / 365)[:, np.newaxis, np.newaxis, np.newaxis]
    ),  # "warming" of 1K per decade,
    dims=("time", "lon", "lat", "return_period"),
    coords={"time": t, "lon": lon, "lat": lat, "return_period": rp},
    attrs={"units": "K"},
)
sim = xr.DataArray(
    (
        -18
        * np.cos(2 * np.pi * t.dayofyear / 365)[:, np.newaxis, np.newaxis, np.newaxis]
        + 2 * np.random.random_sample((t.size, lon.size, lat.size, rp.size))
        + 273.15
        + (0.11 * (t - t[0]).days.values / 365)[:, np.newaxis, np.newaxis, np.newaxis]
    ),  # "warming" of 1.1K per decade
    dims=("time", "lon", "lat", "return_period"),
    coords={"time": t, "lon": lon, "lat": lat, "return_period": rp},
    attrs={"units": "K"},
)

The straight-forward approach I'd like to use takes around 3min:

from xclim import sdba

qdm = sdba.QuantileDeltaMapping.train(
    ref,
    sim,
    group="time.month",
)
sim_adj1 = qdm.adjust(sim)

If I do the same with a loop instead like below, it takes around 1.6min:

qdms = []
datasets = []
for month in range(1, 13):
    ref_m = ref.where(ref.time.dt.month == month, drop=True)
    sim_m = sim.where(sim.time.dt.month == month, drop=True)
    qdm = sdba.QuantileDeltaMapping.train(
        ref_m,
        sim_m,
    )
    qdms.append(qdm)
    datasets.append(qdm.adjust(sim_m))

sim_adj2 = xr.concat(datasets, dim="time")

Any explanation why its like this or how I could improve the performance of the first approach?

Steps To Reproduce

No response

coxipi · 2024-05-08T14:54:04Z

coxipi
May 8, 2024
Collaborator

Hi Sacha, thanks for this nice investigation.

Two things:

interpolation

Say you obtain adjustment factors for the months of January and the month of February, how should we bias adjust jan 31rst? What your implementation with loops over months will do is to apply the adjustment factors of January to jan 31rst, correct? What xclim does is an interpolation between the results of January and February. I'm not 100% this causes the slowdown, a 16x slowdown seems a bit extreme for a simple interpolation, but it's useful to know right now the differences between those two codes!

map_blocks
This might be related to map_blocks, which separates that into the monthly groups in this case. We already investigated how it can be slow compared to other grouping strategies, I think it's more likely that the speedup comes from this than the interpolation.

So I think this might touch on a problem we are already aware of, but we don't have a definitive answer on how to improve things yet.

8 replies

coxipi May 8, 2024
Collaborator

I have a very rough code that pre-organizes data in groups of interest:

def _get_group_complement(da, group):
    # complement of "dayofyear": "year", etc.
    gr = group.name if isinstance(group, sdba.Grouper) else group
    gr = group
    if gr == "time.dayofyear":
        return da.time.dt.year
    if gr == "time.month":
        return da.time.dt.strftime("%Y-%d")


def get_windowed_group(da, group):
    r"""Splits an input array into `group`, its complement, and expands the array along a rolling `window` dimension.

    Aims to give a faster alternative to `map_blocks` constructions.

    """
    group = group if isinstance(group, sdba.Grouper) else sdba.Grouper(group, 1)
    gr, win = group.name, group.window
    gr_dim = gr.split(".")[-1]
    complement_dims = []
    if win > 1:
        win_dim = get_temp_dimname(da.dims, "window_dim")
        da = da.rolling(time=win, center=True).construct(window_dim=win_dim)
        complement_dims.append(win_dim)

    if gr in ["time.month", "time.dayofyear"]:
        gr_complement_dim = gr_dim + "_complement"
        da = da.groupby(gr).apply(
            lambda da: da.assign_coords(time=_get_group_complement(da, gr)).rename(
                {"time": gr_complement_dim}
            )
        )
        complement_dims.append(gr_complement_dim)
        time_dims = complement_dims + [gr_dim]
        # chunking could be removed?
        da = da.chunk({gr_dim: -1, complement_dims[-1]: -1})
    else:
        complement_dims.append(gr_dim)
        gr_dim = None
        time_dims = complement_dims
    return da.assign_attrs(
        {
            "group_dim": gr_dim,
            "complement_dims": complement_dims,
            "time_dims": time_dims,
        }
    )

def ungroup(gr_da, group, template_time):
    r"""Inverse the operation done with :py:func:`get_windowed_group`. Only works if `window` is 1."""
    if isinstance(group, sdba.Grouper):
        gr = group.name
        if group.window > 1:
            ValueError("Ungrouping with window > 1 is not supported")
    else:
        gr = group

    if gr == "time":
        return gr_da
    grouped_time = get_windowed_group(template_time[{d:0 for d in template_time.dims if d != "time"}], gr)
    td = gr_da.attrs["time_dims"]
    da = gr_da.stack(time=gr_da.attrs["time_dims"]).drop_vars(gr_da.attrs["time_dims"]).assign_coords(time=grouped_time.values.ravel())
    return da.where(da.time.notnull(), drop=True)

For instance, splitting a DataArray in month / month_complement (YYYY-DD)

from pathlib import Path
from xclim.testing import open_dataset
ds = open_dataset("sdba/CanESM2_1950-2100.nc")
tx = ds.tasmax
get_windowed_group(tx, "time.month")

With this grouping, I can do the same as the loop above (it's not well adapted yet, I need to do some pre-processing):

# rename the complement dimension to "time" for use with QuantileMappings functions
refm = get_windowed_group(ref, "time.month").rename({"month_complement":"time"})
simm = get_windowed_group(sim, "time.month").rename({"month_complement":"time"})

# We also need proper time coordinates to be used with quantile mappings:
refm["time"] = [t.replace("-", "-01-") for t in refm["time"].values]
simm["time"] = [t.replace("-", "-01-") for t in simm["time"].values]
refm["time"] = refm.time.astype("datetime64[ns]")
simm["time"] = simm.time.astype("datetime64[ns]")

qdm = sdba.QuantileDeltaMapping.train(
    refm,
    simm,
)
sim_adj1 = qdm.adjust(simm).compute()
ungrouped_sim_adj1 = ungroup(sim_adj1.rename({"time":"month_complement"}), "time.month", ref.time)

Taking a sample of 160 spatial points, I get:

normal xclim : 27.6 s
my implementation : 14.1 s
Sacha's loop: 14.1 s

For very large datasets though, it might not be ideal as it may increase the number of Dask tasks and there can be leaks in RAM usage. This approach had a mix of successes and failures. But I think it shows that maybe Groupby is not necessarily the thing to blame in map_blocks limitations?

saschahofmann May 8, 2024
Author

I had the same question as @aulemahal and split the training from the adjustment and the results are reversed.

TLDR: Weirdly training is faster without the loop but adjustment isn't, which sounds like the opposite of what you would expect?

I reduced the data used

import numpy as np
import xarray as xr


# Create toy data to explore bias adjustment, here fake temperature timeseries
t = xr.cftime_range("2000-01-01", "2030-12-31", freq="D", calendar="noleap")
lon = np.arange(0, 30, 0.5)
lat = np.arange(10, 40, 0.5)
ref = xr.DataArray(
    (
        -20 * np.cos(2 * np.pi * t.dayofyear / 365)[:, np.newaxis, np.newaxis]
        + 2 * np.random.random_sample((t.size, lon.size, lat.size))
        + 273.15
        + (0.1 * (t - t[0]).days.values / 365)[:, np.newaxis, np.newaxis]
    ),  # "warming" of 1K per decade,
    dims=(
        "time",
        "lon",
        "lat",
    ),
    coords={"time": t, "lon": lon, "lat": lat},
    attrs={"units": "K"},
)
sim = xr.DataArray(
    (
        -18 * np.cos(2 * np.pi * t.dayofyear / 365)[:, np.newaxis, np.newaxis]
        + 2 * np.random.random_sample((t.size, lon.size, lat.size))
        + 273.15
        + (0.11 * (t - t[0]).days.values / 365)[:, np.newaxis, np.newaxis]
    ),  # "warming" of 1.1K per decade
    dims=("time", "lon", "lat"),
    coords={"time": t, "lon": lon, "lat": lat},
    attrs={"units": "K"},
)

and then used timeit to get some better stats. Pure xclim training:

%%timeit
from xclim import sdba

qdm = sdba.QuantileDeltaMapping.train(
    ref,
    sim,
    group="time.month",
)

takes 4.85 s ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and adjustment qdm.adjust(sim) takes 17.1 s ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).

For the loop

%%timeit
qdms = []
datasets = []
for month in range(1, 13):
    ref_m = ref.where(ref.time.dt.month == month, drop=True)
    sim_m = sim.where(sim.time.dt.month == month, drop=True)
    qdm = sdba.QuantileDeltaMapping.train(
        ref_m,
        sim_m,
    )
    qdms.append(qdm)

training takes 7.48 s ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).

And the adjustment

%%timeit
datasets = []
for month in range(1, 13):
    sim_m = sim.where(sim.time.dt.month == month, drop=True)
    datasets.append(qdms[month-1].adjust(sim_m))

xr.concat(datasets, dim='time')

takes 7.9 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).

The performance issue seems to be in the adjustment, it could of course be that it has to do with the size of the data maybe I will run the experiments with the earlier datasets. Note: also that the loop is much slower if I used daskarrays=chunk the ref and sim DataArrays. This might be independent from the grouping though. But even the normal implementation seems to be faster with in memory numpy arrays (maybe not surprising?).

Maybe I need to check flox again the last time I tried it I had some problems in some other parts of our pipeline but if it improves this example I might spend some more time ironing those out.

Thanks @coxipi for those nice examples! Will check them out tomorrow and see how I can use them for my example.

saschahofmann May 9, 2024
Author

Alright I increased the data size again but results are very simular:

With xclim grouping

training: 31.1 s ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
adjustment: 1min 55s ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Loop

training: 47.3 s ± 34.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
adjustment: 46.4 s ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I am quite surprised. I would have expected that the adjustment is faster than the training in the not-grouped version it takes around the same time but for the grouping it takes ~4x longer than the training? It looks to me like some of these interpolations might be performed multiple times?

saschahofmann Jun 12, 2024
Author

I am finally investigating this again. I increased the number quantiles a little and used pyinstrument to profile it a little.

For the xclim native solution, it spends around the same time in training as in adjusting and during adjustment a majority of the time is spend doing some interpolation (maybe the one you were mentioning above).

  _     ._   __/__   _ _  _  _ _/_   Recorded: 10:28:46  Samples:  14921
 /_//_/// /_\ / //_// / //_'/ //     Duration: 78.405    CPU time: 76.764
/   _/                      v4.6.2

Program: /Users/shofmann/Projects/Sandbox/.venv/bin/pyinstrument adjusting.py

78.403 <module>  adjusting.py:1
├─ 39.120 QuantileDeltaMapping.train  xclim/sdba/adjustment.py:163
│     [157 frames hidden]  xclim, <boltons.funcutils, xarray, numba
│        32.608 _quantile  xclim/sdba/nbutils.py:64
├─ 31.517 QuantileDeltaMapping.adjust  xclim/sdba/adjustment.py:201
│     [194 frames hidden]  xclim, <boltons.funcutils, xarray, nu...
│        23.438 NearestNDInterpolator.__call__  scipy/interpolate/_ndgriddata.py:101
│        └─ 23.312 [self]  scipy/interpolate/_ndgriddata.py
├─ 3.615 <module>  xclim/__init__.py:1
│     [9 frames hidden]  xclim, dask
└─ 2.467 <module>  xarray/__init__.py:1
      [5 frames hidden]  xarray, pandas

For the per-month-loop, training takes approximately the same time but adjusting is significantly faster.

 _     ._   __/__   _ _  _  _ _/_   Recorded: 10:31:21  Samples:  10851
 /_//_/// /_\ / //_// / //_'/ //     Duration: 54.610    CPU time: 54.549
/   _/                      v4.6.2

Program: /Users/shofmann/Projects/Sandbox/.venv/bin/pyinstrument adjusting.py

54.603 <module>  adjusting.py:1
├─ 38.828 QuantileDeltaMapping.train  xclim/sdba/adjustment.py:163
│     [170 frames hidden]  xclim, <boltons.funcutils, xarray, numba
│        35.237 _quantile  xclim/sdba/nbutils.py:64
├─ 7.629 QuantileDeltaMapping.adjust  xclim/sdba/adjustment.py:201
│     [41 frames hidden]  xclim, <boltons.funcutils, xarray, nu...
├─ 3.745 <module>  xclim/__init__.py:1
│     [15 frames hidden]  xclim, dask, scipy, cf_xarray
├─ 1.510 DataArray.where  xarray/core/common.py:1058
│     [13 frames hidden]  xarray, copy, <built-in>
├─ 1.020 <module>  xarray/__init__.py:1
│     [4 frames hidden]  xarray, pandas
└─ 0.600 <module>  xclim/sdba/__init__.py:1

I am going to drill down in the profiling and will also try to understand the implementation better but maybe @aulemahal already has an inkling?

saschahofmann Jun 12, 2024
Author

It looks indeed like the big time loss is coming from interpolating the of adjustment factors. In fact, the time spend doing it increases 5x when doing interp='linear' instead of using the default nearest interpolation.

Shouldnt the nearest interpolation end up being the same as the custom loop?

saschahofmann · 2024-05-09T14:10:51Z

saschahofmann
May 9, 2024
Author

Potentially related, I am trying to understand why QuantileDeltaMapping is much slower when using chunked DaskArrays.

For the numbers below I reduced the data again:

t = xr.cftime_range("2010-01-01", "2030-12-31", freq="D", calendar="noleap")
lon = np.arange(0, 30, 1)
lat = np.arange(10, 40, 1)

Without chunking training takes 976 ms ± 4.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) and adjustment 3.14 s ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each).

If I add .chunk(lon=chunk_size, lat=chunk_size) with chunk_size=10 training suddenly takes ~6x longer
training: 6.83 s ± 40.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
adjustment: 4.41 s ± 75.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

1 reply

coxipi May 10, 2024
Collaborator

I'm not sure how %%timeit works with lazy Dask arrays, but I would throw in a .compute() at the end of code cells you want to benchmark for good measure.

For lack of a better name, I have dubbed the approach I introduced above as "Big Dumb Dataset" or "BDD". I did a few runs with your last dataset (40 lon x 40 lat) and always get similar results, so pardon me for not doing a %%timeit. Also, the total time I get are similar to what I get if I don't compute the time for each step.


----------
xclim-chunked
train:  1.16 s
adjust:  2.62 s
total:  3.78 s

xclim-unchunked
train:  1.38 s
adjust:  8.44 s
total:  9.82 s


----------
bdd-chunked
train:  0.99 s
adjust:  4.77 s
total:  5.76 s

bdd-unchunked
train:  1.82 s
adjust:  3.96 s
total:  5.77 s


----------
loop-chunked
train:  2.81 s
adjust:  7.27 s
total:  10.08 s

loop-unchunked
train:  2.15 s
adjust:  4.75 s
total:  6.9 s

That's for the whole train/adjust. It's a similar result where BDD and Sacha's loop were on par for unchunked data.

For the training step, I had to do something like this:

    qdm = sdba.QuantileDeltaMapping.train(ref0, sim0, group="time.month")
    ds =  qdm.ds.compute()
    qdm = sdba.QuantileDeltaMapping.from_dataset(ds)

to be able to .compute() this step

If I add .chunk(lon=chunk_size, lat=chunk_size) with chunk_size=10 training suddenly takes ~6x longer

I don't see this anywhere, for what method was this? The biggest drop in performance I see is when using unchunked instead of chunked for xclim, where adjust is 3.75x.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QuantileDeltaMapping per month seems to be significantly slower than adjusting the months individually #1747

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

QuantileDeltaMapping per month seems to be significantly slower than adjusting the months individually #1747

saschahofmann May 8, 2024

Setup Information

Context

Steps To Reproduce

Replies: 2 comments · 9 replies

coxipi May 8, 2024 Collaborator

coxipi May 8, 2024 Collaborator

saschahofmann May 8, 2024 Author

saschahofmann May 9, 2024 Author

With xclim grouping

Loop

saschahofmann Jun 12, 2024 Author

saschahofmann Jun 12, 2024 Author

saschahofmann May 9, 2024 Author

coxipi May 10, 2024 Collaborator

saschahofmann
May 8, 2024

Replies: 2 comments 9 replies

coxipi
May 8, 2024
Collaborator

coxipi May 8, 2024
Collaborator

saschahofmann May 8, 2024
Author

saschahofmann May 9, 2024
Author

saschahofmann Jun 12, 2024
Author

saschahofmann Jun 12, 2024
Author

saschahofmann
May 9, 2024
Author

coxipi May 10, 2024
Collaborator