-
-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aggr-] allow ranking rows by key column #2417
base: develop
Are you sure you want to change the base?
Conversation
The standard way is to convert the column into a known type, and then anything that can't be converted (errors and nulls) become |
Yes, that seems like it should work. Should the rank aggregator pick the known type, and if so, which one? Or is it the user who should convert the column? |
Since it's not obvious which type to pick, the user can convert the column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love what this is adding, and I think with a few tweaks it would be even more powerful!
@@ -142,11 +143,39 @@ def __init__(self, pct, helpstr=''): | |||
def aggregate(self, col, rows): | |||
return _percentile(sorted(col.getValues(rows)), self.pct/100, key=float) | |||
|
|||
class RankAggregator(Aggregator): | |||
def __init__(self, *args, **kwargs): | |||
super().__init__(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pass-through __init__
is unnecessary and will happen by default.
with Progress(gerund='grouping', total=sheet.nRows) as prog: | ||
keys_sorted = sorted(((rowkey, i) for i, rowkey in enumerate(keys)), key=_key_progress(prog)) | ||
# group elements by rowkey | ||
with Progress(gerund='ranking', total=sheet.nRows) as prog: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using these Progress objects separately in serial will reset the progress meter. You only need one of them, with a total=3*sheet.nRows
(since there are 3 steps).
If you want to keep the gerunds to indicate the various steps, you can have the first one be 'outermost', and then then other Progress() within that scope with different gerunds (and no total), and only addProgress on the outer one.
|
||
def aggregate(self, col, rows): | ||
if not col.sheet.keyCols: | ||
vd.error('ranking requires one or more key columns') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually true? I could see row number being used if there are no key columns. If we remove this check, does that just work?
Sheet.addCommand('+', 'aggregate-col', 'addAggregators([cursorCol], chooseAggregators())', 'Add aggregator to current column') | ||
Sheet.addCommand('z+', 'memo-aggregate', 'cursorCol.memo_aggregate(chooseAggregators(), selectedRows or rows)', 'memo result of aggregator over values in selected rows for current column') | ||
ColumnsSheet.addCommand('g+', 'aggregate-cols', 'addAggregators(selectedRows or source[0].nonKeyVisibleCols, chooseAggregators())', 'add aggregators to selected source columns') | ||
Sheet.addCommand('', 'addcol-rank', 'addcol_list_aggr(cursorCol, "rank")', 'create new column ranking rows by their key columns') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this specific to "rank"? What happens if we apply a different list aggregator? Should this take an input? (the default value could be "rank" to make it the easiest option).
Also what happens with non-list aggregators? This could be an instant SUM(curcol) GROUP BY keycols
which I think would be a major hit!
This PR adds a
rank
aggregator that returns a list, and a commandaddcol-rank
, which adds a new column with the rank of each row. Ranks are calculated by comparing key columns.It also fixes a bug in
memo-aggregate
where long output takes an extremely long time to show up in the statusbar.For example:
seq 1222333 |vd -
, thenz+
list
. After the list is calculated, visidata will get stuck for many seconds showingprocessing…
, because it's very slow to runformat()
on a long sequence.I think it's worth having an aggregator for rank, and the need for a simpler solution than the current method has come up before. On the other hand, I know part of Visidata philosophy is that it's not a spreadsheet. How do people feel about having a rank aggregator?
Also, in its current form, the rank aggregator will give errors when comparing key columns with different types across 2 rows:
What's the standard way to handle sorting mixed types for Visidata?