-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: "aggregate" command #86
Comments
Could you please provide more detail on what exactly you want? "Output like |
@BurntSushi thanks for drawing my attention to |
@d33tah Did you actually try If you only care about one column and the data is already sorted, then just do |
I'm new to Rust - I based the observation that it stores everything in a hashmap on this file: https://github.com/BurntSushi/rust-stats/blob/master/src/frequency.rs (and the fact that I ran out of RAM) |
The documentation for
So it's not the entire column, but rather, all unique values in the column. By default, More generally, adding another command that assumes the input is sorted does seem useful. But there needs to be a lot more work done on specifying what the exact UX of that is. "Something like I'm not sure when I would work on this, but the quickest way to get someone else to do it is to put in the work to provide a proper specification. Roughly, this means specifying exactly which flags get added to |
To be honest, I have no idea how the UI should work here. :/ |
Hello, I think the aggregate function should work something like this. Let suppose we have this table:
would produce something like:
I don't know if it is worth to implement it. The main issue here I think it is the memory usage. Interesting options would be: |
@micrenda I like the proposal, let's see what @BurntSushi thinks about it. BTW:
Why do you think so? Most (if not all) of the items you listed have constant memory complexity. |
Because you have to keep in memory the group-by keys (in my example: Country, but if could be many more keys). Personally, it would be useful to me. But I know it is very important to use the philosophy "kiss" (keep it simple stupid). There is another way to get the same result, using a different program (I am reporting here if someone has my same problem): In this case the solution is something like this: |
@micrenda Could you please explain why |
First of all, I want to make clear I am not promoting this feature. I just tried to explain the intention of the first author of this thread. Personally, I think any tool must be kept as simple as possible: do just a job and do it efficiently. Implementing all possible user-cases is just impossible and could destroy the design of an elegant application. Returning to our example, the "aggregate" would come from the SQL world. If we have a CSV file in input, it often needs to have in output another CSV with grouped values. I give a more detailed example. Let suppose we have this input CSV
Let suppose I want to know how many inhabitants there are by country:
Let suppose we want to know the max income by country:
Now lets count how many city per region:
And now something which is a no-sense, but is still valid:
As I said before, it is not a simple task and I think it requires a memory usage proportional to the number of distinct grouping keys. So, I am not very sure it is ok to implement it. |
@micrenda Sure... But all of those things can be done today with |
Basically, it's not at all clear to me what @d33tah is asking for here, and how exactly it differs from what's in |
I gave a look to Anyway, if you don't need something like this, I don't think you should implement it. |
OK, there has to be a communication problem here. This is why I've been trying to ask y'all to tell me what the difference is between what you want and what
and you also want frequency analysis, that works too:
The only thing that I can see that is different is the output format. Is that what y'all are asking for? The same computation but in a different output format? |
Hello, I think I am biased by many years of SQL :) Indeed the frequency and stats commands can perform a subset of these function (count, min, max) but they are not as versatile as the GROUP BY. I suggest to leave this thread open: maybe other users will arrive here and explain their user-case. Unfortunately, I am not a rust programmer (I am C++ and Java programmer) so I can not contribute with a patch. |
@d33tah Could you please elaborate on your initial request? If this is really a request for something as elaborate as a general |
Ok, let's do so. I can not work on it right now, but next week I will have some spare time to try to implement it. I wanted to learn Rust, so it is a good opportunity for me. I will implement a patch and then I will send you. If you like, you can integrate it else no problem. |
There is another tool I just ran into that seems to do this. |
Definitely need something like this. My particular use case is a long list of transactions (charges and payments) with an invoice number, a person's name and the transaction amount in each row (the amount can be positive or negative). I'd like to compute the total of each invoice, so something like @micrenda's suggestion above would work (slightly modified syntax to use the standard
|
FWIW I think that sqlite3 can suffice here with |
I am using Miller for this. mlr --csv stats1 -a sum -f 'Name,InvoiceNo' -g 'VAT_number' invoices.csv > invoices-grouped-by-vat-number.csv |
The difference would be the ability to group by multiple columns. On your example data, I want the total population for each country,region pair. I would expect an incantation like this: |
Let's say I have a sorted CSV file with one column and I'd like to get an aggregate view like
uniq -c
, but with proper CSV as an output. Do you consider this worth implementing?The text was updated successfully, but these errors were encountered: