-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add count of empty/non empty values in stats #48
Comments
Isn't there already a count of null values? I know at least the |
Unless I missed something,
|
Unless I missed something As an input to a data science workflow, I'm working with EXIF metadata from 70k+ photos. The csv source file is about 1.1GB and very sparse. I would like to select only columns that are mostly used. This is the output of $ head -5 metadata_frequency.csv
field,value,count
SourceFile,../data/images/6928025/6928025_3fff3925431f80e0dfc83d48e7229a10.jpg,1
SourceFile,../data/images/7031649/7031649_dce502452d13649aa6682060126e2bfb.jpg,1
SourceFile,../data/images/7198981/7198981_6b7e5ef7587dab908ff8864b7c2c4b55.jpg,1
SourceFile,../data/images/7224251/7224251_b68dec5933ec49d3be9213513c7a8e43.jpg,1 This is the output of $ head -5 metadata_count.csv
field,value,count
SourceFile,../data/images/6969761/6969761_9b6c7e93ee82a10a406a34c44d900c60.jpg,1
About,uuid:73abd4a8-2f89-11df-9d3c-a53e4fead10b,1
AccelerationVector,-0.9773887673 -0.006948695972 -0.2358681423,1
Accentuation,0.0,73 It just took the first occurence of "SourceFile" but did not "groupby sum" on it. |
@mratsim Could you please create a new issue and describe the problem you're trying to solve? Your problem seems completely distinct from this issue. In particular, I don't understand what you expect to see and why you think "groupby sum" isn't happening. |
Sorry, I wasn't clear enough. What I'm looking for is counting the number of empty, or non-empty values for each columns like OP. Basically Next paragraph is an explanation on groupby sum and may be skipped Using the database query terms, we would do a Those are equivalent: count if original csv, groupby sum if in the From your example in the README, it's turning:
into
897327 being 238985+215938+176546+141986+123872 Another way to look at it in term of functional programming/Rust is that we are folding (reducing) by summing the values associated to "Country, City, AccentCity" respectively Given the format this should probably appear in |
Ah, gotya, thanks, that's clearer. I can think of a couple of work-arounds, although I suppose it would be nice to provide it by default. For example, you could do |
👍 it would be very natural for |
@BurntSushi @seamusabshere I would be happy to implement this and submit a PR that's at least as tidy as what I submitted for
|
Awesome, thank you! Note that there is a related but orthogonal issue: #38. I defer to you on whether you want to also try and solve that problem, but I just wanted to make sure you were aware of it.
I think my only real opinion here is that it shouldn't contain spaces.
If there are, we can just release a new minor version, but I don't think so. In particular, I think CSV data with headers can be expanded with additional columns (and probably even re-orderings of those columns too). |
A |
Unfortunately, this is currently about 647 items down my TODO list, at least until the next time I need it for something at work. :-( If somebody else wants to tackle this, please don't wait for me! |
xsv stats gives a lot of useful information, and I use it a lot.
I'm currently missing one thing that may not be too much of a hassle to add.
Would you consider adding the count of empty/non empty values for each field?
The text was updated successfully, but these errors were encountered: