Bug: `_aggregate_data()` with integer set labels #285

redst4r · 2024-08-27T10:45:26Z

Hi,

I just ran into this somewhat obscure bug: If you use integers to label your sets, under certain circumstances, UpSet.plot() returns wrong intersection sizes. I tracked the issue down to the reformat._aggregate_data function and how it handles integer set names. Here's a minimal example:

import upsetplot
data_df = upsetplot.from_contents({
    2: set(('A','B','C')),
    10: set(('D','E')),
    20: set(('D', 'E')),
})
data, agg = upsetplot.reformat._aggregate_data(data_df, subset_size='count', sum_over=None)
agg
# 2      10     20   
# True   False  True     3
# False  True   False    2
# Name: size, dtype: int64

Notice how set 2 has no overlap with any of the other sets, yet agg reports 3 items shared between 2 and 20. Also it wrongly reports 2 items exclusive to set 10. Note: If you change the set names to strings, i.e. '2','10','20', it works out fine.

It all has to do with the fact that I used integers to label the set, and in particular one of them (2) is $\le$ the number of sets present.
Here's the relevant line in reformat._aggregate_data:

gb = df.groupby(level=list(range(df.index.nlevels)), sort=False)

We're grouping by level=[0,1,2]. Notice how 2 is ambiguous here: It's supposed to refer to the level (in this case the 3rd set, i.e. set 20), but it is ALSO the name of a set (the 1st set, i.e. set2)! The way groupby() seems to work is to give priority to the setname, rather than the level, and we're basically intersecting the set with itself.

Two options to fix this:

disallow integer set names.
fix the groupby operation, e.g. groupby() on actual column names, rather than level indices:

names = df.index.names
gb = df.reset_index().groupby(by=names, sort=False)

# gb.size() ->
# 2      10     20   
# True   False  False    3
# False  True   True     2
# dtype: int64

Not entirely sure if 2) would cause any trouble with the other functionality (weighted aggregates, summing categories etc).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: `_aggregate_data()` with integer set labels #285

Bug: `_aggregate_data()` with integer set labels #285

redst4r commented Aug 27, 2024

Bug: _aggregate_data() with integer set labels #285

Bug: _aggregate_data() with integer set labels #285

Comments

redst4r commented Aug 27, 2024

Bug: `_aggregate_data()` with integer set labels #285

Bug: `_aggregate_data()` with integer set labels #285