[aggr-] allow ranking rows by key column #2417

midichef · 2024-05-31T06:54:34Z

This PR adds a rank aggregator that returns a list, and a command addcol-rank, which adds a new column with the rank of each row. Ranks are calculated by comparing key columns.

It also fixes a bug in memo-aggregate where long output takes an extremely long time to show up in the statusbar.
For example: seq 1222333 |vd -, then z+ list. After the list is calculated, visidata will get stuck for many seconds showing processing…, because it's very slow to run format() on a long sequence.

I think it's worth having an aggregator for rank, and the need for a simpler solution than the current method has come up before. On the other hand, I know part of Visidata philosophy is that it's not a spreadsheet. How do people feel about having a rank aggregator?

Also, in its current form, the rank aggregator will give errors when comparing key columns with different types across 2 rows:

File "/home/midichef/.local/lib/python3.10/site-packages/visidata/aggregators.py", line 169, in rank
    keys_sorted = sorted(((rowkey, i) for i, rowkey in enumerate(keys)), key=_key_progress(prog))
TypeError: '<' not supported between instances of 'float' and 'list'

What's the standard way to handle sorting mixed types for Visidata?

saulpw · 2024-06-06T00:04:06Z

What's the standard way to handle sorting mixed types for Visidata?

The standard way is to convert the column into a known type, and then anything that can't be converted (errors and nulls) become TypedWrappers which are sortable with any type. Does that work acceptably here too?

midichef · 2024-06-22T08:17:37Z

Yes, that seems like it should work. Should the rank aggregator pick the known type, and if so, which one? Or is it the user who should convert the column?

saulpw · 2024-07-01T05:58:51Z

Since it's not obvious which type to pick, the user can convert the column.

saulpw

I love what this is adding, and I think with a few tweaks it would be even more powerful!

saulpw · 2024-07-01T06:06:37Z

visidata/aggregators.py

@@ -142,11 +143,39 @@ def __init__(self, pct, helpstr=''):
    def aggregate(self, col, rows):
        return _percentile(sorted(col.getValues(rows)), self.pct/100, key=float)

+class RankAggregator(Aggregator):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)


A pass-through __init__ is unnecessary and will happen by default.

saulpw · 2024-07-01T06:13:59Z

visidata/aggregators.py

+    with Progress(gerund='grouping', total=sheet.nRows) as prog:
+        keys_sorted = sorted(((rowkey, i) for i, rowkey in enumerate(keys)), key=_key_progress(prog))
+    # group elements by rowkey
+    with Progress(gerund='ranking', total=sheet.nRows) as prog:


Using these Progress objects separately in serial will reset the progress meter. You only need one of them, with a total=3*sheet.nRows (since there are 3 steps).

If you want to keep the gerunds to indicate the various steps, you can have the first one be 'outermost', and then then other Progress() within that scope with different gerunds (and no total), and only addProgress on the outer one.

saulpw · 2024-07-01T06:16:55Z

visidata/aggregators.py

+
+    def aggregate(self, col, rows):
+        if not col.sheet.keyCols:
+            vd.error('ranking requires one or more key columns')


Is this actually true? I could see row number being used if there are no key columns. If we remove this check, does that just work?

saulpw · 2024-07-01T06:24:09Z

visidata/aggregators.py

 Sheet.addCommand('+', 'aggregate-col', 'addAggregators([cursorCol], chooseAggregators())', 'Add aggregator to current column')
 Sheet.addCommand('z+', 'memo-aggregate', 'cursorCol.memo_aggregate(chooseAggregators(), selectedRows or rows)', 'memo result of aggregator over values in selected rows for current column')
 ColumnsSheet.addCommand('g+', 'aggregate-cols', 'addAggregators(selectedRows or source[0].nonKeyVisibleCols, chooseAggregators())', 'add aggregators to selected source columns')
+Sheet.addCommand('', 'addcol-rank', 'addcol_list_aggr(cursorCol, "rank")', 'create new column ranking rows by their key columns')


Is this specific to "rank"? What happens if we apply a different list aggregator? Should this take an input? (the default value could be "rank" to make it the easiest option).

Also what happens with non-list aggregators? This could be an instant SUM(curcol) GROUP BY keycols which I think would be a major hit!

midichef added 3 commits May 30, 2024 23:39

[aggr-] add aggregator to rank rows by key column value

1cce746

[aggr-] create addcol-rank command

0009867

[aggr-] cap runtime when formatting memo status

1988303

saulpw requested changes Jul 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aggr-] allow ranking rows by key column #2417

[aggr-] allow ranking rows by key column #2417

midichef commented May 31, 2024

saulpw commented Jun 6, 2024

midichef commented Jun 22, 2024

saulpw commented Jul 1, 2024

saulpw left a comment

saulpw Jul 1, 2024

saulpw Jul 1, 2024

saulpw Jul 1, 2024

saulpw Jul 1, 2024

[aggr-] allow ranking rows by key column #2417

Are you sure you want to change the base?

[aggr-] allow ranking rows by key column #2417

Conversation

midichef commented May 31, 2024

saulpw commented Jun 6, 2024

midichef commented Jun 22, 2024

saulpw commented Jul 1, 2024

saulpw left a comment

Choose a reason for hiding this comment

saulpw Jul 1, 2024

Choose a reason for hiding this comment

saulpw Jul 1, 2024

Choose a reason for hiding this comment

saulpw Jul 1, 2024

Choose a reason for hiding this comment

saulpw Jul 1, 2024

Choose a reason for hiding this comment