Bandwidth factor "experimental" for ArviZ kernel density estimate #1648
-
I am new to ArviZ, so I am sorry if my questions are too painfully obvious. I wanted to reproduce the ArviZ kernel density estimate in In the ArviZ documentation, I read that the default bandwidth factor is Attached is a comparison between the Another question I wanted to ask is how ArviZ treats the boundaries/tails. I read that Arviz does not plot the KDE outside the region where I have data. I agree that this is a good feature. I thought that I could replicate the same behavior with the Scipy function by restricting the plot to the interval between the minimum and maximum of my data. However, as you can see in the previous picture, at the edge of the histogram, the Arviz function values are quite different from the Scipy ones (even if I tune the bandwidth factor). I guess this is due to conserving the probability under the KDE estimates. Therefore, I wanted to know how ArviZ adjusts the KDE at the boundaries of the data. Thank you in advance for the help. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Tagging @tomicapretto as he is the one that actually did the work. I know (and remember reading) he did some reports and summaries of the multiple alternatives, but I can't remember where they are. There is a quick overview at #1284 but I can't seem to find the link to the report. |
Beta Was this translation helpful? Give feedback.
-
Thank you so much @OriolAbril for tagging me, I wouldn't have seen this otherwise. Hi @davidedalbosco, I know that "experimental" is not the best of the names I could have chosen, sorry for that. This is the report that Oriol refers to. And you can find here more of the notes I've been writing when trying to implement the new KDE in ArviZ. But let me try to give you an explanation about what ArviZ does with "experimental" bandwidth. First of all, the experimental bandwidth is computed here arviz/arviz/stats/density_utils.py Lines 79 to 83 in b71c83b Don't bother about function arguments, they are passed to avoid computing some things more than once. As we can see there, the experimental bandwidth is just the average of two other bandwidths. Silverman's rule and ISJ which stands for Improved Sheather Jones bandwidth. The reason why I've chosen to average those two results can be found in the report. But I basically did a lot of simulations and concluded it gave better results. On the other hand, there is a boundary correction applied in the tails because the KDE assumes the range of the variable goes from -infinity to +infinity, but we restrict it to the observed domain. That is explained in this specific notebook. Feel free to ask more questions if the documentation is not clear or if you are not sure about something I've said. PS The method used to estimate the density function is still a Gaussian Kernel Density Estimator. |
Beta Was this translation helpful? Give feedback.
Thank you so much @OriolAbril for tagging me, I wouldn't have seen this otherwise.
Hi @davidedalbosco,
I know that "experimental" is not the best of the names I could have chosen, sorry for that. This is the report that Oriol refers to. And you can find here more of the notes I've been writing when trying to implement the new KDE in ArviZ. But let me try to give you an explanation about what ArviZ does with "experimental" bandwidth.
First of all, the experimental bandwidth is computed here
arviz/arviz/stats/density_utils.py
Lines 79 to 83 in b71c83b