Feature request: xsv sort --unique #130

kjedamzik · 2018-07-24T14:16:59Z

would be nice to have a --unique for xsv sort
for ex.:

echo -e 'foo,last_name\nB,kofi\nA,elmo\nC,elmo' |xsv sort -u last_name
foo,last_name
A,elmo
B,kofi

The text was updated successfully, but these errors were encountered:

BurntSushi · 2018-07-24T14:34:49Z

See also #82.

Yomguithereal · 2019-10-28T16:38:02Z

I would also vouch for this. I find it better suited to unix worflows and it kinda mimicks the sort -u gimmick. I am not fond of #82's uniq command.

I would also filter duplicate row on the strict equality of the column selection. This means that in some cases some line would be arbitrarily chosen over the other but we can't be too clever about it anyway, except if we add some flag forcing equality to be done on whole line or on another selection of fields.

I can probably open a PR about this if required.

Yomguithereal · 2020-03-08T18:16:16Z

I opened #238 regarding this issue.

ThomasG77 · 2022-04-12T19:59:12Z

Would prefer an approach on a built-in tool with in xsv. Otherwise, if restricted to adopt other tools, too many choice as there as numerous tools to run sql syntax on csv to do the same. It could be something like https://github.com/harelba/q/

An excerpt to illustrate e.g

wget --content-disposition "https://data.education.gouv.fr/explore/dataset/fr-en-carte-scolaire-colleges-publics/download/?format=csv&timezone=Europe/Berlin&lang=fr&use_labels_for_header=true&csv_separator=%3B"
# from input file
q -H "SELECT DISTINCT code_region,libelle_region,code_academie,libelle_academie,code_departement,libelle_departement,code_insee,libelle_commune,Code_RNE FROM fr-en-carte-scolaire-colleges-publics.csv ORDER BY code_departement,code_insee" -d ';'
# If using stdin as input
cat fr-en-carte-scolaire-colleges-publics.csv | q -H "SELECT DISTINCT code_region,libelle_region,code_academie,libelle_academie,code_departement,libelle_departement,code_insee,libelle_commune,Code_RNE FROM - ORDER BY code_departement,code_insee" -d ';'

You can use option -O in q command line if you want headers in the output

Radiergummi · 2022-06-13T13:58:28Z

Well, what I do (also just stumbled upon this):

xsf fmt file.csv | xsv sort | uniq -u

...which also keeps the header line intact, since it only occurs once.

malthejorgensen · 2022-07-01T10:28:41Z

@Radiergummi Nice – that will work for many use-cases.

Just note that it doesn't work when the individual records contain newlines – e.g. user-generated content like posts in a StackOverflow or Reddit data dump (those are generally unique, but you get the point).

Radiergummi · 2022-07-01T11:40:04Z

@malthejorgensen Wouldn't those line breaks be escaped in the output from xsv sort?

malthejorgensen · 2022-07-01T22:53:42Z

They are escaped by putting " around the value, so the raw newlines are still present in the csv, meaning that uniq will not detect duplicates that contain newlines

# sample.csv
ID,value,date
comment_1,"Yesterday,
I went for a long walk",2022-07-01
comment_1,"Yesterday,
I went for a long walk",2022-07-01
comment_2,"Today,
I stayed inside",2022-07-02

> xsv fmt sample.csv | xsv sort | uniq -u
# Outputs `sample.csv` verbatim

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: xsv sort --unique #130

Feature request: xsv sort --unique #130

kjedamzik commented Jul 24, 2018

BurntSushi commented Jul 24, 2018

Yomguithereal commented Oct 28, 2019 •

edited

Loading

Yomguithereal commented Mar 8, 2020 •

edited

Loading

ThomasG77 commented Apr 12, 2022 •

edited

Loading

Radiergummi commented Jun 13, 2022

malthejorgensen commented Jul 1, 2022

Radiergummi commented Jul 1, 2022

malthejorgensen commented Jul 1, 2022

Feature request: xsv sort --unique #130

Feature request: xsv sort --unique #130

Comments

kjedamzik commented Jul 24, 2018

BurntSushi commented Jul 24, 2018

Yomguithereal commented Oct 28, 2019 • edited Loading

Yomguithereal commented Mar 8, 2020 • edited Loading

ThomasG77 commented Apr 12, 2022 • edited Loading

Radiergummi commented Jun 13, 2022

malthejorgensen commented Jul 1, 2022

Radiergummi commented Jul 1, 2022

malthejorgensen commented Jul 1, 2022

Yomguithereal commented Oct 28, 2019 •

edited

Loading

Yomguithereal commented Mar 8, 2020 •

edited

Loading

ThomasG77 commented Apr 12, 2022 •

edited

Loading