Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: _aggregate_data() with integer set labels #285

Open
redst4r opened this issue Aug 27, 2024 · 0 comments
Open

Bug: _aggregate_data() with integer set labels #285

redst4r opened this issue Aug 27, 2024 · 0 comments

Comments

@redst4r
Copy link

redst4r commented Aug 27, 2024

Hi,

I just ran into this somewhat obscure bug: If you use integers to label your sets, under certain circumstances, UpSet.plot() returns wrong intersection sizes. I tracked the issue down to the reformat._aggregate_data function and how it handles integer set names. Here's a minimal example:

import upsetplot
data_df = upsetplot.from_contents({
    2: set(('A','B','C')),
    10: set(('D','E')),
    20: set(('D', 'E')),
})
data, agg = upsetplot.reformat._aggregate_data(data_df, subset_size='count', sum_over=None)
agg
# 2      10     20   
# True   False  True     3
# False  True   False    2
# Name: size, dtype: int64

Notice how set 2 has no overlap with any of the other sets, yet agg reports 3 items shared between 2 and 20. Also it wrongly reports 2 items exclusive to set 10. Note: If you change the set names to strings, i.e. '2','10','20', it works out fine.

It all has to do with the fact that I used integers to label the set, and in particular one of them (2) is $\le$ the number of sets present.
Here's the relevant line in reformat._aggregate_data:

gb = df.groupby(level=list(range(df.index.nlevels)), sort=False)

We're grouping by level=[0,1,2]. Notice how 2 is ambiguous here: It's supposed to refer to the level (in this case the 3rd set, i.e. set 20), but it is ALSO the name of a set (the 1st set, i.e. set2)! The way groupby() seems to work is to give priority to the setname, rather than the level, and we're basically intersecting the set with itself.

Two options to fix this:

  1. disallow integer set names.
  2. fix the groupby operation, e.g. groupby() on actual column names, rather than level indices:
names = df.index.names
gb = df.reset_index().groupby(by=names, sort=False)

# gb.size() ->
# 2      10     20   
# True   False  False    3
# False  True   True     2
# dtype: int64

Not entirely sure if 2) would cause any trouble with the other functionality (weighted aggregates, summing categories etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant