Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WeightedGroupBy #272

Merged
merged 76 commits into from
Apr 7, 2023
Merged

WeightedGroupBy #272

merged 76 commits into from
Apr 7, 2023

Conversation

AdamOrmondroyd
Copy link
Collaborator

@AdamOrmondroyd AdamOrmondroyd commented Mar 10, 2023

Description

Statistics of WeightedDataFrame.groupby such as mean() ignore weights. Since my ambitious attempt to fix this in pandas itself is looking very unlikely to get through, I have attempted to fix it here. However, they have suggested adding a self._gb_cls attribute to Series and DataFrame, which .groupby() returns. For now, I have copied the methods.

Fixes #260

Checklist:

  • I have performed a self-review of my own code
  • My code is PEP8 compliant (flake8 anesthetic tests)
  • My code contains compliant docstrings (pydocstyle --convention=numpy anesthetic)
  • New and existing unit tests pass locally with my changes (python -m pytest)
  • I have added tests that prove my fix is effective or that my feature works
  • I have appropriately incremented the semantic version number in both README.rst and anesthetic/_version.py
  • Get the mean().mean() case working (test currently commented out). This seems like it would require significantly more work...

@AdamOrmondroyd AdamOrmondroyd changed the title Groupby WeightedGroupby Mar 10, 2023
@AdamOrmondroyd AdamOrmondroyd changed the title WeightedGroupby WeightedGroupBy Mar 10, 2023
@codecov
Copy link

codecov bot commented Mar 10, 2023

Codecov Report

Merging #272 (85fa6ae) into master (1779dad) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #272   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           30        30           
  Lines         2555      2643   +88     
=========================================
+ Hits          2555      2643   +88     
Impacted Files Coverage Δ
anesthetic/_version.py 100.00% <100.00%> (ø)
anesthetic/samples.py 100.00% <100.00%> (ø)
anesthetic/utils.py 100.00% <100.00%> (ø)
anesthetic/weighted_pandas.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@AdamOrmondroyd
Copy link
Collaborator Author

AdamOrmondroyd commented Mar 10, 2023

I'm aware coverage won't be complete as the test only considers the mean. Currently trying to think of a more efficient way of writing the tests

@lukashergt
Copy link
Collaborator

Since my ambitious attempt to fix this in pandas itself is looking very unlikely to get through

The latest comment sounded in favour of the fix, so I'd still have hope...

What does the transform in udf.groupby('group').transform('mean') do? Is that something we could use/leverage?

@AdamOrmondroyd
Copy link
Collaborator Author

AdamOrmondroyd commented Mar 13, 2023

What does the transform in udf.groupby('group').transform('mean') do? Is that something we could use/leverage?

pandas.core.groupby.DataFrameGroupBy.transform(func) takes func and applies it to each group. What we might be able to use/create is something like .transform('weight') to calculate the total weight of each group, which might be useable for the mean of means...

@AdamOrmondroyd
Copy link
Collaborator Author

AdamOrmondroyd commented Mar 14, 2023

@lukashergt The failing tests are to do with building documentation, which I'll be honest I could do with some pointers of what you might like to see here.

Edit: sticking underscores in front of everything hasn't fixed everything, and I probably should actually take care with the documentation for groupby

AdamOrmondroyd and others added 7 commits March 14, 2023 12:26
…erencing pandas is a pain, some of its classes lack docs
* `GroupBy` does not have its own docs.
  - Its initialisation signature looks like a core dump, hence
    implementing our own initialisation function.
  - Trying to cross-reference as
    ``:class:`pandas.core.groupby.GroupBy` `` will fail, hence dropping
    the link attempt. Same goes for `SeriesGroupBy` and
    `DataFrameGroupBy`.
* Dropping `kurtosis` from `WeightedGroupBy`, since it is not
  implemented in `pandas.core.croupby.GroupBy`. Leave that to tacke
  once/if/when we really need it.
* Add docstring adjustments for `WeightedDataFrameGroupBy` and
  `WeightedSeriesGroupBy` to the end of `weighted_pandas` in the same
  way as previously done for `WeightedDataFrame` and `WeightedSeries`.
…ince they have essentially no documentation anyhow
@AdamOrmondroyd
Copy link
Collaborator Author

Even in native pandas they have different formatting

df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.]})
df.hist('Max Speed')

Screenshot 2023-03-30 at 14 29 49

df['Max Speed'].plot.hist()

Screenshot 2023-03-30 at 14 34 38

@AdamOrmondroyd
Copy link
Collaborator Author

I guess I should probably test all the plot types...

@AdamOrmondroyd AdamOrmondroyd mentioned this pull request Apr 3, 2023
11 tasks
Comment on lines +44 to +48
def kurt(self, *args, **kwargs): # noqa: D102
return self._add_weights("kurt", *args, **kwargs)

def kurtosis(self, *args, **kwargs): # noqa: D102
return self._add_weights("kurtosis", *args, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to define kurt and kurtosis here, but don't need to define skew? Yet skew still passes the tests...?

I first thought this had to do with pandas GroupBy not overriding skew as opposed to mean, median, std and var. However, the same would go for kurt and kurtosis, so now I wonder what is different/special about skew?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi I've found that skew needs adding for pandas 2 (otherwise it fails the is weighted test), so could add it in here now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit of an odd one -- pandas 1.5 doesn't actually implement groupby kurt and kurtosis (despite the fact I believe it could). @AdamOrmondroyd does it need adding for pandas 2? My instinct is to leave it in, as it doesn't do any harm.

tests/test_samples.py Show resolved Hide resolved
tests/test_samples.py Outdated Show resolved Hide resolved
@williamjameshandley
Copy link
Collaborator

Is there anything else which needs to be done?

@AdamOrmondroyd
Copy link
Collaborator Author

Is there anything else which needs to be done?

I was working my way through testing all the plots, but so far I haven't found any issues with the plots themselves

@AdamOrmondroyd
Copy link
Collaborator Author

Had a fruitless look into why it's failing on pip but not conda, isn't as simple as numpy 1.23 vs 1.24

@lukashergt
Copy link
Collaborator

Had a fruitless look into why it's failing on pip but not conda, isn't as simple as numpy 1.23 vs 1.24

I think it was a numerical/floating point thing where the covariance matrix turned out positive definite despite having a linear dependent parameter. Anyhow, more helped more (85fa6ae)...

@AdamOrmondroyd
Copy link
Collaborator Author

@lukashergt if you're happy with cov/GelmanRubin then I reckon we're good to go

Copy link
Collaborator

@lukashergt lukashergt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, with the Gelman--Rubin statistic fixed thanks to the functioning groupby I am happy now. If you are happy too, feel free to squash and merge.

Thanks, @AdamOrmondroyd, @williamjameshandley

@AdamOrmondroyd AdamOrmondroyd merged commit 17f1ae4 into handley-lab:master Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WeightedDataFrame.groupby.mean does not produce weighted means
3 participants