Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aggr-] allow ranking rows by key column #2417

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

midichef
Copy link
Contributor

This PR adds a rank aggregator that returns a list, and a command addcol-rank, which adds a new column with the rank of each row. Ranks are calculated by comparing key columns.

It also fixes a bug in memo-aggregate where long output takes an extremely long time to show up in the statusbar.
For example: seq 1222333 |vd -, then z+ list. After the list is calculated, visidata will get stuck for many seconds showing processing…, because it's very slow to run format() on a long sequence.

I think it's worth having an aggregator for rank, and the need for a simpler solution than the current method has come up before. On the other hand, I know part of Visidata philosophy is that it's not a spreadsheet. How do people feel about having a rank aggregator?

Also, in its current form, the rank aggregator will give errors when comparing key columns with different types across 2 rows:

File "/home/midichef/.local/lib/python3.10/site-packages/visidata/aggregators.py", line 169, in rank
    keys_sorted = sorted(((rowkey, i) for i, rowkey in enumerate(keys)), key=_key_progress(prog))
TypeError: '<' not supported between instances of 'float' and 'list'

What's the standard way to handle sorting mixed types for Visidata?

@saulpw
Copy link
Owner

saulpw commented Jun 6, 2024

What's the standard way to handle sorting mixed types for Visidata?

The standard way is to convert the column into a known type, and then anything that can't be converted (errors and nulls) become TypedWrappers which are sortable with any type. Does that work acceptably here too?

@midichef
Copy link
Contributor Author

Yes, that seems like it should work. Should the rank aggregator pick the known type, and if so, which one? Or is it the user who should convert the column?

@saulpw
Copy link
Owner

saulpw commented Jul 1, 2024

Since it's not obvious which type to pick, the user can convert the column.

Copy link
Owner

@saulpw saulpw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love what this is adding, and I think with a few tweaks it would be even more powerful!

@@ -142,11 +143,39 @@ def __init__(self, pct, helpstr=''):
def aggregate(self, col, rows):
return _percentile(sorted(col.getValues(rows)), self.pct/100, key=float)

class RankAggregator(Aggregator):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pass-through __init__ is unnecessary and will happen by default.

with Progress(gerund='grouping', total=sheet.nRows) as prog:
keys_sorted = sorted(((rowkey, i) for i, rowkey in enumerate(keys)), key=_key_progress(prog))
# group elements by rowkey
with Progress(gerund='ranking', total=sheet.nRows) as prog:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using these Progress objects separately in serial will reset the progress meter. You only need one of them, with a total=3*sheet.nRows (since there are 3 steps).

If you want to keep the gerunds to indicate the various steps, you can have the first one be 'outermost', and then then other Progress() within that scope with different gerunds (and no total), and only addProgress on the outer one.


def aggregate(self, col, rows):
if not col.sheet.keyCols:
vd.error('ranking requires one or more key columns')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually true? I could see row number being used if there are no key columns. If we remove this check, does that just work?

Sheet.addCommand('+', 'aggregate-col', 'addAggregators([cursorCol], chooseAggregators())', 'Add aggregator to current column')
Sheet.addCommand('z+', 'memo-aggregate', 'cursorCol.memo_aggregate(chooseAggregators(), selectedRows or rows)', 'memo result of aggregator over values in selected rows for current column')
ColumnsSheet.addCommand('g+', 'aggregate-cols', 'addAggregators(selectedRows or source[0].nonKeyVisibleCols, chooseAggregators())', 'add aggregators to selected source columns')
Sheet.addCommand('', 'addcol-rank', 'addcol_list_aggr(cursorCol, "rank")', 'create new column ranking rows by their key columns')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this specific to "rank"? What happens if we apply a different list aggregator? Should this take an input? (the default value could be "rank" to make it the easiest option).

Also what happens with non-list aggregators? This could be an instant SUM(curcol) GROUP BY keycols which I think would be a major hit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants