You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've just had a look at the code for InferenceData, and here are my thoughts about using DataTree to replace that class.
Similarities
I was pleasantly surprised by how similar the two classes are. They both:
Map strings to xr.Datasets
Have to- and from-netcdf and zarr methods
Copy a large portion of the xr.Dataset API by mapping it over all stored groups
Have a map-like method to map an arbitrary function over all groups
Store global attributes
This should mean its not too hard to swap one out for the other.
Differences
However there are some significant differences:
Data stored as unnamed datasets under a string key
Currently DataTree doesn't quite have a dict-like structure, because each node knows its own name, and different nodes are stored in a tuple. I do however plan to change this to a dict-like key-node mapping soon though.
Attribute-like access returns stored datasets, not a datatree object
Because a DataTree is an arbitrarily-nested recursive data structure, selecting a particular group must also return a DataTree object, not an xr.Dataset like InferenceData does. DataTree allows you to easily convert the data in that group to an xr.Dataset, but I think this is an awkward difference in API that is inevitable unless a new InferenceData class wrapped DataTree.
Also the return value from InferenceData is a reference to the dataset object, it's not copied, which is something we need to be careful about.
"Warmup datasets"
InferenceData has these "warmup datasets" which are attached on initialization, which seem to be some kind of initial guess? (I'm not particularly familiar with Bayesian analysis...) The question is how to store these if using DataTree instead.
You could reimplement InferenceData to wrap two entire DataTree objects, accessing the "warmup dataset" when requested.
Alternatively you could store each "warmup dataset" in a second layer of a single datatree, like this
DataTree('root', parent=None)
├── DataTree('prior')
│ │ Dimensions: (x: 2, y: 3)
│ │ Coordinates:
│ │ * x (x) int64 10 20
│ │ Dimensions without coordinates: y
│ │ Data variables:
│ │ foo (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298
│ │ bar (x) int64 1 2
│ └── DataTree('warmup')
│ Dimensions: (x: 2, y: 3)
│ Coordinates:
│ * x (x) int64 10 20
│ Dimensions without coordinates: y
│ Data variables:
│ foo (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298
│ bar (x) int64 1 2
└── DataTree('posterior')
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 10 20
Dimensions without coordinates: y
Data variables:
foo (x, y) float64 0.2337 -1.755 0.5762 -0.4241 0.7463 -0.4298
bar (x) int64 1 2
baz float64 3.142
But this might lead to unintuitive behaviour if presented to the user, as operations called on the tree would automatically map over all sub-groups, including these "warmup" ones.
"Matching samples"
I'm not quite sure what this feature of InferenceData does, but it seems like it's something that can be determined just from iterating over a tree? In which case it's probably best implemented as a function or accessor method.
Typing
There is a difference in typing because InferenceData implements the Mapping protocol, but this probably isn't particularly important.
InferenceData.__add__
__add__-ing two InferenceData objects will concat them instead of performing element-wise addition. This is a complete contrast to DataTree, where __add__ will add variables in corresponding groups to one another. The latter way is much more similar to how xr.Dataset.__add__ works, and I don't think DataTree should deviate from this.
concat
Similarly arviz.concat behaves in a way inconsistent with xarray conventions, as it mixes xarray's concept of concat with merge. If we eventually update xr.concat to deal with DataTree objects then it still wouldn't work quite the same way that arviz.concat does now.
filtering elements with a regex
This is something that InferenceData can do which DataTree cannot yet. I would welcome PRs on this but the implementation would need to be carefully thought-through for the nested-case, and there is no guarantee that the best solution for DataTree matches the API currently presented by InferenceData
Other API differences
There are some other small API differences (e.g. map vs map_over_subtree).
Implementation
Now the behaviour of DataTree could potentially be changed in some ways if it makes sense, but the API is also more constrained than for InferenceData because we need to
support arbitrary nesting (not just one layer deep),
be domain-agnostic, and
follow existing xarray conventions as closely as possible.
That's why the basic choice to be made on the arviz side is whether you would want to directly use and expose DataTree objects, or just wrap them and use them as a convenient internal data structure. Either way could work, and I think could save you a lot of code!
Thanks for the super detailed comment. I am slowly familiarizing myself with DataTree and I'll try to answer some of the points, but can't still speak on everything.
Attribute-like access returns stored datasets, not a datatree object
I think this is good. IIUC, one would now be able to do idata["posterior"].ds to get the posterior group as a dataset, but idata["posterior"] would return a DataTree. There are some cases in which users create models that can be used on their own or as pieces of larger models. In such cases, variables inside a submodel are currently indentified with submodel_name::var_name but it would probably be better to have them be nested groups.
Note: this will probably mean doing some changes to ArviZ functions, probably to plotting, but imo it will be worth it to 1) take advantage of the improved datatree features and 2) stop maintaining InferenceData
"Warmup datasets"
The warmup datasets as hidden attributes is something we have discussed a couple times to change. These groups are generally not present, and are useless to users of bayesian models, they are mostly useful to people working on mcmc algorithms and in some cases to diagnose why some model is not being sampled correctly. I still have to look into that to see what could make sense. From a conceptual point of view, having those as subgroups inside their non-warmup counterpart makes sense, but I can't think of any case where we'd want functions to be applied to both warmup and non-warmup groups in the same way.
"Matching samples"
Does this come from this page? It isn't a feature yet, it's more of a useful concept and assumption made by several of our functions.
Here is some intuition on the genral idea: the core of bayes+mcmc methods is generating samples via mcmc from the posterior distribution of the parameters in our model (because trying to get the posterior analytically is impossible). We then get the posterior dataset with one sample (sample means a pair of chain+draw values) for each variable (potentially multidimensional even before adding the chain and draw dimensions). But also, as they were generated with mcmc we get some summary statistics for each sample, and the pointwise log likelihood values also at each sample. And when we generate predictions we generally generate one prediction per posterior sample.
Currently each of these groups have a chain and a draw dimension that are independent of each other because they are independent datasets, but we assume are the same in some plots or in model comparison or in model criticism.
If possible, it might be useful to share that dimension/coordinate values between groups, but there are also cases where generating predictions is expensive for example, and we might have enough generating them for a subset of the samples only in which case it would need to be possible to have independent dimension too. And it would be perfectly fine to keep all dimensions independent and assuming they are the same, nobody has ever complained about this.
Typing
I sincerely have no idea how any of the typing things related to inferencedata work.
InferenceData.__add__
I think we can remove or deprecate that already cc @ahartikainen to make the transition easier. I am not sure anyone is using and I wouldn't have recommended anyone to use it but told them to use concat explicitly.
concat
If all cases covered in az.concat can be done on datatrees with either xr.concat or xr.merge I think that would not be an issue, otherwise we'd probably keep it but update it to work on datatrees as it covers important use cases that we'd want to support (even if it only worked on datatrees with only 1 level).
filtering elements with a regex
I might be able to help with that at some point, but I think it will take me some time to be able to do anything useful (both because I don't have a lot of free time on my hands and because understanding datatrees and especially the cases that are different from ArviZ will probably take some time by itself too). Is there an issue related to this I could track? Or alternatively feel free to tag me if you open an issue on this or there is a PR to test
Now the behaviour of DataTree could potentially be changed in some ways if it makes sense, but the API is also more constrained than for InferenceData because we need to
I haven't still figured out how subsetting/filtering works with DataTrees. I'd be happy to write some docs as I learn this, but still quite unsure how to go about this. For now, I'm trying to look at things like getting a subset of the datatree that consists of multiple groups or applying a function to the variable x that is present in 3 out of 5 groups of the datatree. Depending on what is possible and your future plans it might be good to extend some functions; i.e. if dt[["posterior", "posterior_predictive"]] is not possible, then map_over_subtree would benefit from a group kwarg so it doesn't apply the function to all groups/children. At least in our case, we want to be able to apply functions to multiple groups at once, but that hardly ever means all groups at once.
IIUC, one would now be able to do idata["posterior"].ds to get the posterior group as a dataset, but idata["posterior"] would return a DataTree.
Yes that's correct.
Note: this will probably mean doing some changes to ArviZ functions, probably to plotting, but imo it will be worth it to 1) take advantage of the improved datatree features and 2) stop maintaining InferenceData
I agree.
From a conceptual point of view, having those as subgroups inside their non-warmup counterpart makes sense, but I can't think of any case where we'd want functions to be applied to both warmup and non-warmup groups in the same way.
Okay, we'll have to think about this more carefully.
If possible, it might be useful to share that dimension/coordinate values between groups, but there are also cases where generating predictions is expensive for example, and we might have enough generating them for a subset of the samples only in which case it would need to be possible to have independent dimension too. And it would be perfectly fine to keep all dimensions independent and assuming they are the same, nobody has ever complained about this.
Thanks for the explanation about matching samples. I want to make it easier to refer to one coordinate from multiple different groups, which might help with this.
Yes, I think we can deprecate add.
Great.
If all cases covered in az.concat can be done on datatrees with either xr.concat or xr.merge I think that would not be an issue, otherwise we'd probably keep it but update it to work on datatrees as it covers important use cases that we'd want to support (even if it only worked on datatrees with only 1 level).
So I expect that xr.concat and xr.merge aren't going to accept DataTree objects until the whole datatree package is implemented upstream in xarray's main repo. But if we made az.concat and az.merge consistent with how we eventually want xr.concat and xr.mergeto behave, then that would smooth the eventual transition. Alternatively we could just forget about concat/merge functions and concentrate on the method syntax for these operations instead, e.g. write DataTree.merge analogously to Dataset.merge.
For concat, we could always create a couple special functions
This also seems like a good idea.
I haven't still figured out how subsetting/filtering works with DataTrees.
Depending on what is possible and your future plans it might be good to extend some functions; i.e. if dt[["posterior", "posterior_predictive"]] is not possible, then map_over_subtree would benefit from a group kwarg so it doesn't apply the function to all groups/children. At least in our case, we want to be able to apply functions to multiple groups at once, but that hardly ever means all groups at once.
Activity
TomNicholas commentedon Apr 14, 2022
Thanks for coming to the xarray community meeting yesterday to ask about datatree @OriolAbril!
I've just had a look at the code for
InferenceData
, and here are my thoughts about usingDataTree
to replace that class.Similarities
I was pleasantly surprised by how similar the two classes are. They both:
xr.Dataset
sxr.Dataset
API by mapping it over all stored groupsmap
-like method to map an arbitrary function over all groupsThis should mean its not too hard to swap one out for the other.
Differences
However there are some significant differences:
Data stored as unnamed datasets under a string key
Currently DataTree doesn't quite have a dict-like structure, because each node knows its own name, and different nodes are stored in a tuple. I do however plan to change this to a dict-like key-node mapping soon though.
Attribute-like access returns stored datasets, not a datatree object
Because a
DataTree
is an arbitrarily-nested recursive data structure, selecting a particular group must also return aDataTree
object, not anxr.Dataset
likeInferenceData
does.DataTree
allows you to easily convert the data in that group to anxr.Dataset
, but I think this is an awkward difference in API that is inevitable unless a newInferenceData
class wrappedDataTree
.Also the return value from
InferenceData
is a reference to the dataset object, it's not copied, which is something we need to be careful about."Warmup datasets"
InferenceData
has these "warmup datasets" which are attached on initialization, which seem to be some kind of initial guess? (I'm not particularly familiar with Bayesian analysis...) The question is how to store these if usingDataTree
instead.You could reimplement
InferenceData
to wrap two entireDataTree
objects, accessing the "warmup dataset" when requested.Alternatively you could store each "warmup dataset" in a second layer of a single datatree, like this
But this might lead to unintuitive behaviour if presented to the user, as operations called on the tree would automatically map over all sub-groups, including these "warmup" ones.
"Matching samples"
I'm not quite sure what this feature of
InferenceData
does, but it seems like it's something that can be determined just from iterating over a tree? In which case it's probably best implemented as a function or accessor method.Typing
There is a difference in typing because
InferenceData
implements theMapping
protocol, but this probably isn't particularly important.InferenceData.__add__
__add__
-ing twoInferenceData
objects willconcat
them instead of performing element-wise addition. This is a complete contrast toDataTree
, where__add__
will add variables in corresponding groups to one another. The latter way is much more similar to howxr.Dataset.__add__
works, and I don't thinkDataTree
should deviate from this.concat
Similarly
arviz.concat
behaves in a way inconsistent with xarray conventions, as it mixes xarray's concept ofconcat
withmerge
. If we eventually updatexr.concat
to deal withDataTree
objects then it still wouldn't work quite the same way thatarviz.concat
does now.filtering elements with a regex
This is something that
InferenceData
can do whichDataTree
cannot yet. I would welcome PRs on this but the implementation would need to be carefully thought-through for the nested-case, and there is no guarantee that the best solution forDataTree
matches the API currently presented byInferenceData
Other API differences
There are some other small API differences (e.g.
map
vsmap_over_subtree
).Implementation
Now the behaviour of
DataTree
could potentially be changed in some ways if it makes sense, but the API is also more constrained than forInferenceData
because we need toThat's why the basic choice to be made on the arviz side is whether you would want to directly use and expose
DataTree
objects, or just wrap them and use them as a convenient internal data structure. Either way could work, and I think could save you a lot of code!(Tagging @jhamman, @alexamici, and @aurghs in case any of you are interested)
OriolAbril commentedon Apr 22, 2022
Thanks for the super detailed comment. I am slowly familiarizing myself with DataTree and I'll try to answer some of the points, but can't still speak on everything.
I think this is good. IIUC, one would now be able to do
idata["posterior"].ds
to get the posterior group as a dataset, butidata["posterior"]
would return a DataTree. There are some cases in which users create models that can be used on their own or as pieces of larger models. In such cases, variables inside a submodel are currently indentified withsubmodel_name::var_name
but it would probably be better to have them be nested groups.Note: this will probably mean doing some changes to ArviZ functions, probably to plotting, but imo it will be worth it to 1) take advantage of the improved datatree features and 2) stop maintaining InferenceData
The warmup datasets as hidden attributes is something we have discussed a couple times to change. These groups are generally not present, and are useless to users of bayesian models, they are mostly useful to people working on mcmc algorithms and in some cases to diagnose why some model is not being sampled correctly. I still have to look into that to see what could make sense. From a conceptual point of view, having those as subgroups inside their non-warmup counterpart makes sense, but I can't think of any case where we'd want functions to be applied to both warmup and non-warmup groups in the same way.
Does this come from this page? It isn't a feature yet, it's more of a useful concept and assumption made by several of our functions.
Here is some intuition on the genral idea: the core of bayes+mcmc methods is generating samples via mcmc from the posterior distribution of the parameters in our model (because trying to get the posterior analytically is impossible). We then get the posterior dataset with one sample (sample means a pair of chain+draw values) for each variable (potentially multidimensional even before adding the chain and draw dimensions). But also, as they were generated with mcmc we get some summary statistics for each sample, and the pointwise log likelihood values also at each sample. And when we generate predictions we generally generate one prediction per posterior sample.
Currently each of these groups have a chain and a draw dimension that are independent of each other because they are independent datasets, but we assume are the same in some plots or in model comparison or in model criticism.
If possible, it might be useful to share that dimension/coordinate values between groups, but there are also cases where generating predictions is expensive for example, and we might have enough generating them for a subset of the samples only in which case it would need to be possible to have independent dimension too. And it would be perfectly fine to keep all dimensions independent and assuming they are the same, nobody has ever complained about this.
I sincerely have no idea how any of the typing things related to inferencedata work.
I think we can remove or deprecate that already cc @ahartikainen to make the transition easier. I am not sure anyone is using and I wouldn't have recommended anyone to use it but told them to use concat explicitly.
If all cases covered in
az.concat
can be done on datatrees with eitherxr.concat
orxr.merge
I think that would not be an issue, otherwise we'd probably keep it but update it to work on datatrees as it covers important use cases that we'd want to support (even if it only worked on datatrees with only 1 level).I might be able to help with that at some point, but I think it will take me some time to be able to do anything useful (both because I don't have a lot of free time on my hands and because understanding datatrees and especially the cases that are different from ArviZ will probably take some time by itself too). Is there an issue related to this I could track? Or alternatively feel free to tag me if you open an issue on this or there is a PR to test
I haven't still figured out how subsetting/filtering works with DataTrees. I'd be happy to write some docs as I learn this, but still quite unsure how to go about this. For now, I'm trying to look at things like getting a subset of the datatree that consists of multiple groups or applying a function to the variable x that is present in 3 out of 5 groups of the datatree. Depending on what is possible and your future plans it might be good to extend some functions; i.e. if
dt[["posterior", "posterior_predictive"]]
is not possible, thenmap_over_subtree
would benefit from agroup
kwarg so it doesn't apply the function to all groups/children. At least in our case, we want to be able to apply functions to multiple groups at once, but that hardly ever means all groups at once.ahartikainen commentedon Apr 23, 2022
Yes, I think we can deprecate
__add__
.ahartikainen commentedon Apr 23, 2022
For concat, we could always create a couple special functions
Or maybe one which can do both special cases
TomNicholas commentedon Apr 25, 2022
Yes that's correct.
I agree.
Okay, we'll have to think about this more carefully.
Thanks for the explanation about matching samples. I want to make it easier to refer to one coordinate from multiple different groups, which might help with this.
Great.
So I expect that
xr.concat
andxr.merge
aren't going to acceptDataTree
objects until the whole datatree package is implemented upstream in xarray's main repo. But if we madeaz.concat
andaz.merge
consistent with how we eventually wantxr.concat
andxr.merge
to behave, then that would smooth the eventual transition. Alternatively we could just forget aboutconcat/merge
functions and concentrate on the method syntax for these operations instead, e.g. writeDataTree.merge
analogously toDataset.merge
.This also seems like a good idea.
That's because it's not really been defined yet 😅
There is now!: xarray-contrib/datatree#79
This is useful to know already, but we can discuss details on xarray-contrib/datatree#79