Feature request: "aggregate" command #86

d33tah · 2017-07-09T11:55:19Z

Let's say I have a sorted CSV file with one column and I'd like to get an aggregate view like uniq -c, but with proper CSV as an output. Do you consider this worth implementing?

The text was updated successfully, but these errors were encountered:

BurntSushi · 2017-07-09T12:45:44Z

Could you please provide more detail on what exactly you want? "Output like uniq -c" isn't really descriptive enough, since it doesn't tell me what you're looking for. For example, what do you want that isn't already provided by xsv stats or xsv frequency?

d33tah · 2017-07-09T13:01:07Z

@BurntSushi thanks for drawing my attention to xsv frequency! It's pretty close to what I'm looking for, but having a hash table is not really an option - in my case I can't really keep a whole column in memory. If there was a way to add an optional assumption that the file is already sorted, that would do though...

BurntSushi · 2017-07-09T13:20:40Z

@d33tah Did you actually try xsv frequency? It does not store the entire column in memory.

If you only care about one column and the data is already sorted, then just do xsv select the-columns data.csv | uniq -c.

d33tah · 2017-07-09T13:23:03Z

uniq -c doesn't generate CSV and there are edge cases where it breaks output. I did try xsv frequency and with -l 0 I ran out of memory while trying to process a 6GiB file with multiple columns.

I'm new to Rust - I based the observation that it stores everything in a hashmap on this file:

https://github.com/BurntSushi/rust-stats/blob/master/src/frequency.rs

(and the fact that I ran out of RAM)

BurntSushi · 2017-07-09T13:46:10Z

The documentation for xsv frequency states:

Since this computes an exact frequency table, memory proportional to the
cardinality of each column is required.

So it's not the entire column, but rather, all unique values in the column.

By default, xsv frequency computes a frequency table for every single column. You might have better luck computing the table for a single column. You can do that with the -s/--select flag.

More generally, adding another command that assumes the input is sorted does seem useful. But there needs to be a lot more work done on specifying what the exact UX of that is. "Something like uniq -c" is unfortunately not good enough since CSV data has multiple columns. That is, since frequency tables are computed for each column, the sort must apply to a single particular column, which is a bit at odds with how the current implementation favors looking at multiple columns. This in turn probably implies that a flag stating that the data is sorted must also require specifying a single column from which to compute the frequency table.

I'm not sure when I would work on this, but the quickest way to get someone else to do it is to put in the work to provide a proper specification. Roughly, this means specifying exactly which flags get added to xsv frequency and what their semantics are.

d33tah · 2017-07-09T19:11:02Z

To be honest, I have no idea how the UI should work here. :/

micrenda · 2018-07-06T10:15:47Z

Hello,

I think the aggregate function should work something like this. Let suppose we have this table:

Country  AccentCity       Population
es       Barañáin         22264
es       Puerto Real      36946
at       Moosburg         4602

xsv aggregate Country --sum Population --min Population --max Population --count

would produce something like:

Country   Population(sum) Population(min) Population(max) count
es              59210           22264           36946         2
at               4602            4602            4602         1

I don't know if it is worth to implement it. The main issue here I think it is the memory usage.

Interesting options would be:
--sum
--min
--max
--avg
--stdev
--stderr
--count

d33tah · 2018-07-06T10:19:51Z

@micrenda I like the proposal, let's see what @BurntSushi thinks about it. BTW:

The main issue here I think it is the memory usage.

Why do you think so? Most (if not all) of the items you listed have constant memory complexity.

micrenda · 2018-07-06T10:31:06Z

Because you have to keep in memory the group-by keys (in my example: Country, but if could be many more keys).

Personally, it would be useful to me. But I know it is very important to use the philosophy "kiss" (keep it simple stupid).

There is another way to get the same result, using a different program (I am reporting here if someone has my same problem):
q - Text as Data - https://github.com/harelba/q

In this case the solution is something like this:
q "select c2, sum(c3) from ./data.csv group by c2"

BurntSushi · 2018-07-06T13:08:41Z

@micrenda Could you please explain why xsv stats is insufficient for your use case? It literally provides exactly the same information as shown in your example for xsv aggregate.

micrenda · 2018-07-06T13:55:14Z

First of all, I want to make clear I am not promoting this feature. I just tried to explain the intention of the first author of this thread. Personally, I think any tool must be kept as simple as possible: do just a job and do it efficiently. Implementing all possible user-cases is just impossible and could destroy the design of an elegant application.

Returning to our example, the "aggregate" would come from the SQL world. If we have a CSV file in input, it often needs to have in output another CSV with grouped values.

I give a more detailed example. Let suppose we have this input CSV

country,    region,     city,   people,     income
IT,         MI,         Milan,   10000,  50000000.
IT,         MI,         Pavia,    2000,   1000000.
IT,         RM,         Rome,     5000,   8000000.
IT,         PA,         Palermo,  5000,    200000.
FR,         PA,         Paris,   10000,  50000000.
FR,         LY,         Lyon,     6000,   2000000.
UK,         LN,         London,  34000,  60000000.

Let suppose I want to know how many inhabitants there are by country:
xsv aggregate country --sum people

country,  	 sum(people)
IT, 		  22000
FR, 		  10000
FR, 		   6000
UK, 		  34000

Let suppose we want to know the max income by country:
xsv aggregate country --max income

country,  	 max(income)
IT, 		  50000000
FR, 		  50000000
FR, 		   2000000
UK, 		  60000000

Now lets count how many city per region:
xsv aggregate country region --count

country,    region,  count
IT,         MI,         2
IT,         RM,         1
IT,         PA,         1
FR,         PA,         1
FR,         LY,         1
UK,         LN,         1

And now something which is a no-sense, but is still valid:

xsv aggregate region --count

region,  count
MI,         2
RM,         1
PA,         2
LY,         1
LN,         1

As I said before, it is not a simple task and I think it requires a memory usage proportional to the number of distinct grouping keys. So, I am not very sure it is ok to implement it.

BurntSushi · 2018-07-06T14:12:02Z

@micrenda Sure... But all of those things can be done today with xsv stats and xsv frequency...

BurntSushi · 2018-07-06T14:13:31Z

Basically, it's not at all clear to me what @d33tah is asking for here, and how exactly it differs from what's in xsv already.

micrenda · 2018-07-06T14:36:51Z

I gave a look to xsv stats and xsv frequency and I am not very sure they are able to do it (or, at least, in a simple way). For my problem I solved using another program.

Anyway, if you don't need something like this, I don't think you should implement it.
It is possible not many users need it.

BurntSushi · 2018-07-06T14:45:01Z

OK, there has to be a communication problem here. This is why I've been trying to ask y'all to tell me what the difference is between what you want and what xsv stats and xsv frequency already provides. But you just keep describing what you want instead of the diff. From what I can tell, all you want is the ability to compute various statistics over the data. xsv stats does that:

[andrew@Cheetah xsv-86] xsv table data.csv
country  region  city     people  income
IT       MI      Milan    10000   50000000
IT       MI      Pavia    2000    1000000
IT       RM      Rome     5000    8000000
IT       PA      Palermo  5000    200000
FR       PA      Paris    10000   50000000
FR       LY      Lyon     6000    2000000
UK       LN      London   34000   60000000
[andrew@Cheetah xsv-86] xsv stats data.csv | xsv table
field    type     sum        min     max       min_length  max_length  mean                stddev
country  Unicode             FR      UK        2           2
region   Unicode             LN      RM        2           2
city     Unicode             London  Rome      4           7
people   Integer  72000      2000    34000     4

and you also want frequency analysis, that works too:

[andrew@Cheetah xsv-86] xsv frequency data.csv | xsv table
field    value     count
country  IT        4
country  FR        2
country  UK        1
region   PA        2
region   MI        2
region   LY        1
region   LN        1
region   RM        1
city     Paris     1
city     Pavia     1
city     Rome      1
city     Lyon      1
city     Palermo   1
city     Milan     1
city     London    1
people   10000     2
people   5000      2
people   2000      1
people   34000     1
people   6000      1
income   50000000  2
income   60000000  1
income   2000000   1
income   8000000   1
income   1000000   1
income   200000    1

The only thing that I can see that is different is the output format. Is that what y'all are asking for? The same computation but in a different output format?

micrenda · 2018-07-06T15:14:37Z

Hello,

I think I am biased by many years of SQL :)
What I was thinking about was a "GROUP BY" like, with aggregate functions like "SUM, COUNT, MIN, MAX, etc.". If is not just the format, but really a way to consolidate table (it is similar to Excel Pivot tables).

Indeed the frequency and stats commands can perform a subset of these function (count, min, max) but they are not as versatile as the GROUP BY.

I suggest to leave this thread open: maybe other users will arrive here and explain their user-case. Unfortunately, I am not a rust programmer (I am C++ and Java programmer) so I can not contribute with a patch.

BurntSushi · 2018-07-06T15:16:01Z

@d33tah Could you please elaborate on your initial request? If this is really a request for something as elaborate as a general GROUP BY, then this will be easy for me to close, because I don't think xsv will have that. It is not a goal for xsv to become as versatile as SQL.

micrenda · 2018-07-06T15:20:28Z

Ok, let's do so. I can not work on it right now, but next week I will have some spare time to try to implement it. I wanted to learn Rust, so it is a good opportunity for me.

I will implement a patch and then I will send you. If you like, you can integrate it else no problem.

jonathanvx · 2018-11-29T01:48:19Z

There is another tool I just ran into that seems to do this.
https://www.gnu.org/software/datamash/examples/#example_grouping

elifiner · 2019-03-12T02:34:56Z

Definitely need something like this. My particular use case is a long list of transactions (charges and payments) with an invoice number, a person's name and the transaction amount in each row (the amount can be positive or negative). I'd like to compute the total of each invoice, so something like @micrenda's suggestion above would work (slightly modified syntax to use the standard --select option):

xsv aggregate --select Name,InvoiceNo --sum Amount

twmb · 2019-09-20T06:52:47Z

FWIW I think that sqlite3 can suffice here with .import <file.csv> table and aggregating based off that. The output of xsv can be fed into sqlite3. As well, the output of sqlite3 can be fed into xsv.

jackdbd · 2021-08-30T21:04:49Z

Definitely need something like this. My particular use case is a long list of transactions (charges and payments) with an invoice number, a person's name and the transaction amount in each row (the amount can be positive or negative). I'd like to compute the total of each invoice, so something like @micrenda's suggestion above would work (slightly modified syntax to use the standard --select option):
xsv aggregate --select Name,InvoiceNo --sum Amount

I am using Miller for this.

mlr --csv stats1 -a sum -f 'Name,InvoiceNo' -g 'VAT_number' invoices.csv > invoices-grouped-by-vat-number.csv

gavinwahl · 2024-09-06T00:12:28Z

@BurntSushi

OK, there has to be a communication problem here. This is why I've been trying to ask y'all to tell me what the difference is between what you want and what xsv stats and xsv frequency already provides.

The difference would be the ability to group by multiple columns. On your example data, I want the total population for each country,region pair. I would expect an incantation like this: xsv aggregate country,region --sum people to do that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: "aggregate" command #86

Feature request: "aggregate" command #86

d33tah commented Jul 9, 2017

BurntSushi commented Jul 9, 2017

d33tah commented Jul 9, 2017

BurntSushi commented Jul 9, 2017

d33tah commented Jul 9, 2017 •

edited

Loading

BurntSushi commented Jul 9, 2017

d33tah commented Jul 9, 2017

micrenda commented Jul 6, 2018 •

edited

Loading

d33tah commented Jul 6, 2018 •

edited

Loading

micrenda commented Jul 6, 2018 •

edited

Loading

BurntSushi commented Jul 6, 2018

micrenda commented Jul 6, 2018 •

edited

Loading

BurntSushi commented Jul 6, 2018

BurntSushi commented Jul 6, 2018

micrenda commented Jul 6, 2018

BurntSushi commented Jul 6, 2018

micrenda commented Jul 6, 2018

BurntSushi commented Jul 6, 2018 •

edited

Loading

micrenda commented Jul 6, 2018

jonathanvx commented Nov 29, 2018

elifiner commented Mar 12, 2019

twmb commented Sep 20, 2019

jackdbd commented Aug 30, 2021

gavinwahl commented Sep 6, 2024

Feature request: "aggregate" command #86

Feature request: "aggregate" command #86

Comments

d33tah commented Jul 9, 2017

BurntSushi commented Jul 9, 2017

d33tah commented Jul 9, 2017

BurntSushi commented Jul 9, 2017

d33tah commented Jul 9, 2017 • edited Loading

BurntSushi commented Jul 9, 2017

d33tah commented Jul 9, 2017

micrenda commented Jul 6, 2018 • edited Loading

d33tah commented Jul 6, 2018 • edited Loading

micrenda commented Jul 6, 2018 • edited Loading

BurntSushi commented Jul 6, 2018

micrenda commented Jul 6, 2018 • edited Loading

BurntSushi commented Jul 6, 2018

BurntSushi commented Jul 6, 2018

micrenda commented Jul 6, 2018

BurntSushi commented Jul 6, 2018

micrenda commented Jul 6, 2018

BurntSushi commented Jul 6, 2018 • edited Loading

micrenda commented Jul 6, 2018

jonathanvx commented Nov 29, 2018

elifiner commented Mar 12, 2019

twmb commented Sep 20, 2019

jackdbd commented Aug 30, 2021

gavinwahl commented Sep 6, 2024

d33tah commented Jul 9, 2017 •

edited

Loading

micrenda commented Jul 6, 2018 •

edited

Loading

d33tah commented Jul 6, 2018 •

edited

Loading

micrenda commented Jul 6, 2018 •

edited

Loading

micrenda commented Jul 6, 2018 •

edited

Loading

BurntSushi commented Jul 6, 2018 •

edited

Loading