Track DataTree progress #2015

New issue

Open

Track DataTree progress#2015

Labels

Enhancement

OriolAbril

Member

DataTree should replace InferenceData and provide much better feature set

added

Thanks for coming to the xarray community meeting yesterday to ask about datatree @OriolAbril!

I've just had a look at the code for InferenceData, and here are my thoughts about using DataTree to replace that class.

Similarities

I was pleasantly surprised by how similar the two classes are. They both:

Map strings to xr.Datasets
Have to- and from-netcdf and zarr methods
Copy a large portion of the xr.Dataset API by mapping it over all stored groups
Have a map-like method to map an arbitrary function over all groups
Store global attributes

This should mean its not too hard to swap one out for the other.

Differences

However there are some significant differences:

Data stored as unnamed datasets under a string key

Currently DataTree doesn't quite have a dict-like structure, because each node knows its own name, and different nodes are stored in a tuple. I do however plan to change this to a dict-like key-node mapping soon though.
Attribute-like access returns stored datasets, not a datatree object

Because a DataTree is an arbitrarily-nested recursive data structure, selecting a particular group must also return a DataTree object, not an xr.Dataset like InferenceData does. DataTree allows you to easily convert the data in that group to an xr.Dataset, but I think this is an awkward difference in API that is inevitable unless a new InferenceData class wrapped DataTree.

Also the return value from InferenceData is a reference to the dataset object, it's not copied, which is something we need to be careful about.

"Warmup datasets"

InferenceData has these "warmup datasets" which are attached on initialization, which seem to be some kind of initial guess? (I'm not particularly familiar with Bayesian analysis...) The question is how to store these if using DataTree instead.

You could reimplement InferenceData to wrap two entire DataTree objects, accessing the "warmup dataset" when requested.

Alternatively you could store each "warmup dataset" in a second layer of a single datatree, like this

DataTree('root', parent=None)
├── DataTree('prior')
│   │   Dimensions:  (x: 2, y: 3)
│   │   Coordinates:
│   │     * x        (x) int64 10 20
│   │   Dimensions without coordinates: y
│   │   Data variables:
│   │       foo      (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298
│   │       bar      (x) int64 1 2
│   └── DataTree('warmup')
│       Dimensions:  (x: 2, y: 3)
│       Coordinates:
│         * x        (x) int64 10 20
│       Dimensions without coordinates: y
│       Data variables:
│           foo      (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298
│           bar      (x) int64 1 2
└── DataTree('posterior')
    Dimensions:  (x: 2, y: 3)
    Coordinates:
      * x        (x) int64 10 20
    Dimensions without coordinates: y
    Data variables:
        foo      (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298
        bar      (x) int64 1 2
        baz      float64 3.142

But this might lead to unintuitive behaviour if presented to the user, as operations called on the tree would automatically map over all sub-groups, including these "warmup" ones.

"Matching samples"

I'm not quite sure what this feature of InferenceData does, but it seems like it's something that can be determined just from iterating over a tree? In which case it's probably best implemented as a function or accessor method.
Typing

There is a difference in typing because InferenceData implements the Mapping protocol, but this probably isn't particularly important.
InferenceData.__add__

__add__-ing two InferenceData objects will concat them instead of performing element-wise addition. This is a complete contrast to DataTree, where __add__ will add variables in corresponding groups to one another. The latter way is much more similar to how xr.Dataset.__add__ works, and I don't think DataTree should deviate from this.
concat

Similarly arviz.concat behaves in a way inconsistent with xarray conventions, as it mixes xarray's concept of concat with merge. If we eventually update xr.concat to deal with DataTree objects then it still wouldn't work quite the same way that arviz.concat does now.
filtering elements with a regex

This is something that InferenceData can do which DataTree cannot yet. I would welcome PRs on this but the implementation would need to be carefully thought-through for the nested-case, and there is no guarantee that the best solution for DataTree matches the API currently presented by InferenceData
Other API differences

There are some other small API differences (e.g. map vs map_over_subtree).

Implementation

Now the behaviour of DataTree could potentially be changed in some ways if it makes sense, but the API is also more constrained than for InferenceData because we need to

support arbitrary nesting (not just one layer deep),
be domain-agnostic, and
follow existing xarray conventions as closely as possible.

That's why the basic choice to be made on the arviz side is whether you would want to directly use and expose DataTree objects, or just wrap them and use them as a convenient internal data structure. Either way could work, and I think could save you a lot of code!

(Tagging @jhamman, @alexamici, and @aurghs in case any of you are interested)

OriolAbril

MemberAuthor

Thanks for the super detailed comment. I am slowly familiarizing myself with DataTree and I'll try to answer some of the points, but can't still speak on everything.

Attribute-like access returns stored datasets, not a datatree object

I think this is good. IIUC, one would now be able to do idata["posterior"].ds to get the posterior group as a dataset, but idata["posterior"] would return a DataTree. There are some cases in which users create models that can be used on their own or as pieces of larger models. In such cases, variables inside a submodel are currently indentified with submodel_name::var_name but it would probably be better to have them be nested groups.

Note: this will probably mean doing some changes to ArviZ functions, probably to plotting, but imo it will be worth it to 1) take advantage of the improved datatree features and 2) stop maintaining InferenceData

"Warmup datasets"

The warmup datasets as hidden attributes is something we have discussed a couple times to change. These groups are generally not present, and are useless to users of bayesian models, they are mostly useful to people working on mcmc algorithms and in some cases to diagnose why some model is not being sampled correctly. I still have to look into that to see what could make sense. From a conceptual point of view, having those as subgroups inside their non-warmup counterpart makes sense, but I can't think of any case where we'd want functions to be applied to both warmup and non-warmup groups in the same way.

"Matching samples"

Does this come from this page? It isn't a feature yet, it's more of a useful concept and assumption made by several of our functions.

Here is some intuition on the genral idea: the core of bayes+mcmc methods is generating samples via mcmc from the posterior distribution of the parameters in our model (because trying to get the posterior analytically is impossible). We then get the posterior dataset with one sample (sample means a pair of chain+draw values) for each variable (potentially multidimensional even before adding the chain and draw dimensions). But also, as they were generated with mcmc we get some summary statistics for each sample, and the pointwise log likelihood values also at each sample. And when we generate predictions we generally generate one prediction per posterior sample.

Currently each of these groups have a chain and a draw dimension that are independent of each other because they are independent datasets, but we assume are the same in some plots or in model comparison or in model criticism.

If possible, it might be useful to share that dimension/coordinate values between groups, but there are also cases where generating predictions is expensive for example, and we might have enough generating them for a subset of the samples only in which case it would need to be possible to have independent dimension too. And it would be perfectly fine to keep all dimensions independent and assuming they are the same, nobody has ever complained about this.

Typing

I sincerely have no idea how any of the typing things related to inferencedata work.

InferenceData.__add__

I think we can remove or deprecate that already cc @ahartikainen to make the transition easier. I am not sure anyone is using and I wouldn't have recommended anyone to use it but told them to use concat explicitly.

concat

If all cases covered in az.concat can be done on datatrees with either xr.concat or xr.merge I think that would not be an issue, otherwise we'd probably keep it but update it to work on datatrees as it covers important use cases that we'd want to support (even if it only worked on datatrees with only 1 level).

filtering elements with a regex

I might be able to help with that at some point, but I think it will take me some time to be able to do anything useful (both because I don't have a lot of free time on my hands and because understanding datatrees and especially the cases that are different from ArviZ will probably take some time by itself too). Is there an issue related to this I could track? Or alternatively feel free to tag me if you open an issue on this or there is a PR to test

Now the behaviour of DataTree could potentially be changed in some ways if it makes sense, but the API is also more constrained than for InferenceData because we need to

I haven't still figured out how subsetting/filtering works with DataTrees. I'd be happy to write some docs as I learn this, but still quite unsure how to go about this. For now, I'm trying to look at things like getting a subset of the datatree that consists of multiple groups or applying a function to the variable x that is present in 3 out of 5 groups of the datatree. Depending on what is possible and your future plans it might be good to extend some functions; i.e. if dt[["posterior", "posterior_predictive"]] is not possible, then map_over_subtree would benefit from a group kwarg so it doesn't apply the function to all groups/children. At least in our case, we want to be able to apply functions to multiple groups at once, but that hardly ever means all groups at once.

ahartikainen

Contributor

Yes, I think we can deprecate __add__.

ahartikainen

Contributor

For concat, we could always create a couple special functions

az.combine_chains
az.combine_draws

Or maybe one which can do both special cases

az.combine_samples

TomNicholas

IIUC, one would now be able to do idata["posterior"].ds to get the posterior group as a dataset, but idata["posterior"] would return a DataTree.

Yes that's correct.

Note: this will probably mean doing some changes to ArviZ functions, probably to plotting, but imo it will be worth it to 1) take advantage of the improved datatree features and 2) stop maintaining InferenceData

I agree.

From a conceptual point of view, having those as subgroups inside their non-warmup counterpart makes sense, but I can't think of any case where we'd want functions to be applied to both warmup and non-warmup groups in the same way.

Okay, we'll have to think about this more carefully.

If possible, it might be useful to share that dimension/coordinate values between groups, but there are also cases where generating predictions is expensive for example, and we might have enough generating them for a subset of the samples only in which case it would need to be possible to have independent dimension too. And it would be perfectly fine to keep all dimensions independent and assuming they are the same, nobody has ever complained about this.

Thanks for the explanation about matching samples. I want to make it easier to refer to one coordinate from multiple different groups, which might help with this.

Yes, I think we can deprecate add.

Great.

If all cases covered in az.concat can be done on datatrees with either xr.concat or xr.merge I think that would not be an issue, otherwise we'd probably keep it but update it to work on datatrees as it covers important use cases that we'd want to support (even if it only worked on datatrees with only 1 level).

So I expect that xr.concat and xr.merge aren't going to accept DataTree objects until the whole datatree package is implemented upstream in xarray's main repo. But if we made az.concat and az.merge consistent with how we eventually want xr.concat and xr.mergeto behave, then that would smooth the eventual transition. Alternatively we could just forget about concat/merge functions and concentrate on the method syntax for these operations instead, e.g. write DataTree.merge analogously to Dataset.merge.

For concat, we could always create a couple special functions

This also seems like a good idea.

I haven't still figured out how subsetting/filtering works with DataTrees.

That's because it's not really been defined yet 😅

Is there an issue related to this I could track?

There is now!: xarray-contrib/datatree#79

Depending on what is possible and your future plans it might be good to extend some functions; i.e. if dt[["posterior", "posterior_predictive"]] is not possible, then map_over_subtree would benefit from a group kwarg so it doesn't apply the function to all groups/children. At least in our case, we want to be able to apply functions to multiple groups at once, but that hardly ever means all groups at once.

This is useful to know already, but we can discuss details on xarray-contrib/datatree#79