-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Is your feature request related to a problem? Please describe.
open_dataset()
currently has a very long function signature. This makes it hard to keep track of everything it can do, and is particularly problematic for the authors of new backends (e.g., see #4477), which might need to know how to handle all these arguments.
Describe the solution you'd like
To simple the interface, I propose to group together all the decoding options into a new DecodingOptions
class. I'm thinking something like:
from dataclasses import dataclass, field, asdict
from typing import Optional, List
@dataclass(frozen=True)
class DecodingOptions:
mask: Optional[bool] = None
scale: Optional[bool] = None
datetime: Optional[bool] = None
timedelta: Optional[bool] = None
use_cftime: Optional[bool] = None
concat_characters: Optional[bool] = None
coords: Optional[bool] = None
drop_variables: Optional[List[str]] = None
@classmethods
def disabled(cls):
return cls(mask=False, scale=False, datetime=False, timedelta=False,
concat_characters=False, coords=False)
def non_defaults(self):
return {k: v for k, v in asdict(self).items() if v is not None}
# add another method for creating default Variable Coder() objects,
# e.g., those listed in encode_cf_variable()
The signature of open_dataset
would then become:
def open_dataset(
filename_or_obj,
group=None,
*
engine=None,
chunks=None,
lock=None,
cache=None,
backend_kwargs=None,
decode: Union[DecodingOptions, bool] = None,
**deprecated_kwargs
):
if decode is None:
decode = DecodingOptions()
if decode is False:
decode = DecodingOptions.disabled()
# handle deprecated_kwargs...
...
Question: are decode
and DecodingOptions
the right names? Maybe these should still include the name "CF", e.g., decode_cf
and CFDecodingOptions
, given that these are specific to CF conventions?
Note: the current signature is open_dataset(filename_or_obj, group=None, decode_cf=True, mask_and_scale=None, decode_times=True, autoclose=None, concat_characters=True, decode_coords=True, engine=None, chunks=None, lock=None, cache=None, drop_variables=None, backend_kwargs=None, use_cftime=None, decode_timedelta=None)
Usage with the new interface would look like xr.open_dataset(filename, decode=False)
or xr.open_dataset(filename, decode=xr.DecodingOptions(mask=False, scale=False))
.
This requires a little bit more typing than what we currently have, but it has a few advantages:
- It's easier to understand the role of different arguments. Now there is a function with ~8 arguments and a class with ~8 arguments rather than a function with ~15 arguments.
- It's easier to add new decoding arguments (e.g., for more advanced CF conventions), because they don't clutter the
open_dataset
interface. For example, I separated outmask
andscale
arguments, versus the currentmask_and_scale
argument. - If a new backend plugin for
open_dataset()
needs to handle every option supported byopen_dataset()
, this makes that task significantly easier. The only decoding options they need to worry about are non-default options that were explicitly set, i.e., those exposed by thenon_defaults()
method. If another decoding option wasn't explicitly set and isn't recognized by the backend, they can just ignore it.
Describe alternatives you've considered
For the overall approach:
- We could keep the current design, with separate keyword arguments for decoding options, and just be very careful about passing around these arguments. This seems pretty painful for the backend refactor, though.
- We could keep the current design only for the user facing
open_dataset()
interface, and then internally convert into theDecodingOptions()
struct for passing to backend constructors. This would provide much needed flexibility for backend authors, but most users wouldn't benefit from the new interface. Perhaps this would make sense as an intermediate step?
Activity
dcherian commentedon Oct 6, 2020
Totally in favour of this. Option 2 does seem like a good intermediate step.
This proposal would make error handling easier (#3020)
I agree that we should add CF to the names. Can we use
Decoders
instead ofDecodingOptions
? Sodecode_cf
andCFDecoders
?shoyer commentedon Oct 6, 2020
works for me!
aurghs commentedon Oct 6, 2020
I agree,
open_dataset()
currently has a very long signature that should be changed.The interface you proposed is obviously clearer, but a class could give a false idea that all backends support all the decoding options listed in the class. I see two other alternatives:
For both these proposals, we would lose the autocompletion with the tab but, on the other hand, the user would be relieved of managing a class.
Finally, I'm not sure that for the user it would be clear the separation between backend_kwargs and decode, since they both contain arguments that will be passed to the backend. Especially if the backend needs more specific decoding options that must be set in backend_kwargs. In this sense, #4309 seems less error-prone.
shoyer commentedon Oct 7, 2020
I see this as complementary to @alexamici's proposal in #4309 (comment), which I also like. I guess the main difference is moving
decode_cf
into the explicit signature ofopen_dataset
, rather than leaving it in**kwargs
.Auto-completion is one reason to prefer a class, but enforced error checking and consistency between backends in the data model are also good reasons. In particular, it is important that users get an error if they mis-spell an argument name, e.g.,
open_dataset(path, decode_times=False)
vsopen_dataset(path, decode_time=False)
. We can definitely achieve this with putting decoding options into adict
, too, but we would need to be carefully to always validate the set of dict keys.I guess cfgrib is an example of a backend with its own CF decoding options? This is indeed a tricky design question. I don't know if it's possible to make
xarray.open_dataset()
directly extensible in this way -- this could still be a reason for a user to use a backend-specificcfgrib.open_dataset()
function.alexamici commentedon Oct 8, 2020
@shoyer I favour option 2. as a stable solution, possibly with all arguments moved to keyword-only ones:
open_dataset
I'm for using the words decode/decoding but I'm against the use of CF.
What backend will do is map the on-disk representation of the data (typically optimised for space) to the in-memory representation preferred by xarray (typically optimised of easy of use). This mapping is especially tricky for time-like variables.
CF is one possible way to specify the map, but it is not the only one. Both the GRIB format and all the spatial formats supported by GDAL/rasterio can encode rich data and decoding has (typically) nothing to do with the CF conventions.
My preferred meaning for the
decode_
-options is:True
: the backend attempts to map the data to the xarray natural data types (np.datetime64
,np.float
with mask and scale)False
: the backend attempts to return a representation of the data as close as possible to the on-disk oneTypically when a user asks the backend not to decode they intend to insepct the content of the data file to understand why something is not mapping in the expected way.
As an example: in the case of GRIB time-like values are represented as integers like
20190101
, but cfgrib at the moment is forced to convert them into a fake CF representation before passing them to xarray, and when usingdecode_times=False
a GRIB user is presented with something that has nothing to do with the on-disk representation.aurghs commentedon Oct 29, 2020
Taking into account the comments in this issue and the calls, I would propose this solution: #4547
xr.open_dataset
using newdecoding_kwargs
dict #937916 remaining items