Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: xsv sort --unique #130

Open
kjedamzik opened this issue Jul 24, 2018 · 8 comments
Open

Feature request: xsv sort --unique #130

kjedamzik opened this issue Jul 24, 2018 · 8 comments

Comments

@kjedamzik
Copy link

would be nice to have a --unique for xsv sort
for ex.:

echo -e 'foo,last_name\nB,kofi\nA,elmo\nC,elmo' |xsv sort -u last_name
foo,last_name
A,elmo
B,kofi
@BurntSushi
Copy link
Owner

See also #82.

@Yomguithereal
Copy link
Contributor

Yomguithereal commented Oct 28, 2019

I would also vouch for this. I find it better suited to unix worflows and it kinda mimicks the sort -u gimmick. I am not fond of #82's uniq command.

I would also filter duplicate row on the strict equality of the column selection. This means that in some cases some line would be arbitrarily chosen over the other but we can't be too clever about it anyway, except if we add some flag forcing equality to be done on whole line or on another selection of fields.

I can probably open a PR about this if required.

@Yomguithereal
Copy link
Contributor

Yomguithereal commented Mar 8, 2020

I opened #238 regarding this issue.

@ThomasG77
Copy link

ThomasG77 commented Apr 12, 2022

Would prefer an approach on a built-in tool with in xsv. Otherwise, if restricted to adopt other tools, too many choice as there as numerous tools to run sql syntax on csv to do the same. It could be something like https://github.com/harelba/q/

An excerpt to illustrate e.g

wget --content-disposition "https://data.education.gouv.fr/explore/dataset/fr-en-carte-scolaire-colleges-publics/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_for_header=true&csv_separator=%3B"
# from input file
q -H "SELECT DISTINCT code_region,libelle_region,code_academie,libelle_academie,code_departement,libelle_departement,code_insee,libelle_commune,Code_RNE FROM fr-en-carte-scolaire-colleges-publics.csv ORDER BY code_departement,code_insee" -d ';'
# If using stdin as input
cat fr-en-carte-scolaire-colleges-publics.csv | q -H "SELECT DISTINCT code_region,libelle_region,code_academie,libelle_academie,code_departement,libelle_departement,code_insee,libelle_commune,Code_RNE FROM - ORDER BY code_departement,code_insee" -d ';'

You can use option -O in q command line if you want headers in the output

@Radiergummi
Copy link

Well, what I do (also just stumbled upon this):

xsf fmt file.csv | xsv sort | uniq -u

...which also keeps the header line intact, since it only occurs once.

@malthejorgensen
Copy link

@Radiergummi Nice – that will work for many use-cases.

Just note that it doesn't work when the individual records contain newlines – e.g. user-generated content like posts in a StackOverflow or Reddit data dump (those are generally unique, but you get the point).

@Radiergummi
Copy link

@malthejorgensen Wouldn't those line breaks be escaped in the output from xsv sort?

@malthejorgensen
Copy link

They are escaped by putting " around the value, so the raw newlines are still present in the csv, meaning that uniq will not detect duplicates that contain newlines

# sample.csv
ID,value,date
comment_1,"Yesterday,
I went for a long walk",2022-07-01
comment_1,"Yesterday,
I went for a long walk",2022-07-01
comment_2,"Today,
I stayed inside",2022-07-02
> xsv fmt sample.csv | xsv sort | uniq -u
# Outputs `sample.csv` verbatim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants