Feature Request: deduplicate columns/extract unique columns #84

peterjc · 2021-10-14T12:12:45Z

We can use qsv dedup or the Unix command line tools sort and uniq to remove duplicate rows in plain text table, but I find myself wanting to do something similar with duplicated columns.

For example, after doing qsv join ... there will be at least one pair of duplicated columns (the values used for the join).

I am hoping for something like a column based version of the row based qsv dedup command (see #26).

I suspect I could workaround this via the qsv transpose command (see #3).

The text was updated successfully, but these errors were encountered:

jqnatividad · 2021-10-14T15:58:12Z

Another workaround is select.

But while I was looking into this, I saw this pending PR to add a --merge option to join.

eddy-geek/xsv@643683c

I'm now looking to adapt it to qsv as well. :)

peterjc · 2021-10-14T16:01:21Z

I had spotted BurntSushi/xsv#114 too, and the --merge idea in the join command would help.

My real use case is merging several "join" tables from another tool, which all share the first dozen columns (and values).

jqnatividad · 2021-10-14T16:18:37Z

Got it. You're not just deduping duplicate headers.
And like you said, doing a transpose, then a dedup, then another transpose should do the trick. Have you tried that?

peterjc · 2021-10-14T16:29:14Z

Not tested yet - I got sidetracked by conda-forge (see #85), but could try your pre-compiled binaries instead.

peterjc · 2021-10-14T16:37:32Z

The transpose/dedup/transpose workaround isn't quite what I wanted as it has also sorted the columns (and I wanted to preserve the order keeping the first occurrence only). I wonder how often people would want a row-based dedup which preserves order?

Using select would probably be best although I may have to construct the desired column list by hand, which will be tedious.

jqnatividad · 2021-10-14T17:46:33Z

And don't forget that select allows you to differentiate between identically named columns with the [] selector, e.g.

qsv select 'Foo[2]'

to select the second column named 'Foo'.

Also, maybe you can use the headers command to extract the column names, eliminate the identically-named columns, and then use that for select?

Regardless, if you come up with a useful recipe, please do share it in the Cookbook.

eddy-geek · 2021-10-14T17:50:14Z

I can redo the PR here if it helps.
(if it does not conflict too much, my rust is... rusty)

jqnatividad · 2021-10-14T18:05:57Z

@eddy-geek Please do!
I'm using qsv myself to rust up... 😉

jqnatividad · 2021-10-19T17:29:58Z

Hi @eddy-geek , just wanted to give you a heads-up that I modified join to have left-semi and left-anti joins...

As is, they only take columns from the left relation, so it shouldn't affect your PR for deduping column names...

peterjc · 2021-11-11T13:30:28Z

Looking at #89 and #90, while --merge was briefly merged (to drop the duplicated columns which the join produces), it was reverted to to a performance regression.

If that was working, it would solve my use case fairly well. Here I merge on column 3 of base_fields.tsv and column 2 of source_*.tsv (which becomes column 1 after the cut operation to discard all the other repeated columns):

cp base_fields.tsv working.tsv
for TSV in source_*.tsv; do
   xsv join --right 3 working.tsv 1 <(cut -f 2,36- $TSV) | xsv fmt -t "\t" > new.tsv
   mv new.tsv working.tsv
done

That looks to be working nicely, other than the duplication of the join column.

(My original request of a column deduplication command would make this even easier)

jqnatividad · 2021-11-11T17:46:01Z

@peterjc , can you add that to the Cookbook?

Hopefully, @eddy-geek can redo his old PR and we can get the --merge option.

BTW, the performance regression may have been a false positive... I just installed WSL at the time and I have since uninstalled it. I ran the benchmarks on WSL and it was giving some bad numbers which I may have unnecessarily attributed to the PR.

peterjc · 2021-11-11T20:15:12Z

Ah, the wiki page https://github.com/jqnatividad/qsv/wiki/Cookbook#cookbook - I could do that. Maybe a simpler version with CSV files only.

peterjc · 2021-11-12T11:50:13Z

I don't see that I can edit the wiki (likely restricted to collaborators which is fine), so suggested text:

Multi-table join avoiding repeated columns

This example was inspired by having to combine multiple tables exported from another system, which were themselves from multiple database joins. Suppose you have have several tables (table_*.csv) which have the same first 10 columns, and then a varying number of additional columns. The column we want to join on is column 2, and for simplicity assume the rows all match perfectly (otherwise you would explore the left and right join options).

cp table_A.csv combined.csv
for NEXT in table_B.csv table_C.csv table_D.csv; do
    qsv join --merge 2 combined.csv 1 <(qsv select 2,11- $NEXT) > new.csv
    mv new.csv combined.csv
done

We use a loop to perform multiple joins. Each time we use xsv select to pull out the index (join column 2) and the columns unique to that file (11 onwards), which could also be done with cut -s "," -f 1,11- $NEXT if preferred. The join column becomes column 1 of the intermediate file.

The --merge option stops duplication of the join column.

jqnatividad · 2021-11-12T14:50:02Z

@peterjc , this is awesome. Thanks!

I just opened up the wiki, do you mind adding the article yourself?

I really want the wiki to be a community resource, and being one of the early qsv adopters, I'd really appreciate it if you make the first community contribution to it!

peterjc · 2021-11-12T15:17:43Z

Done.

github-actions · 2022-01-12T10:33:13Z

Stale issue message

jqnatividad · 2023-09-27T18:46:01Z

Should somebody stumble into this - the polars powered joinp command does not have this problem.

github-actions bot added the no-issue-activity label Jan 12, 2022

github-actions bot closed this as completed Jan 20, 2022

jqnatividad removed the no-issue-activity label Jun 20, 2023

jqnatividad mentioned this issue Sep 27, 2023

Another attempt to package qsv written in rust conda-forge/staged-recipes#24081

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: deduplicate columns/extract unique columns #84

Feature Request: deduplicate columns/extract unique columns #84

peterjc commented Oct 14, 2021 •

edited

Loading

jqnatividad commented Oct 14, 2021

peterjc commented Oct 14, 2021

jqnatividad commented Oct 14, 2021

peterjc commented Oct 14, 2021

peterjc commented Oct 14, 2021

jqnatividad commented Oct 14, 2021

eddy-geek commented Oct 14, 2021

jqnatividad commented Oct 14, 2021

jqnatividad commented Oct 19, 2021

peterjc commented Nov 11, 2021 •

edited

Loading

jqnatividad commented Nov 11, 2021

peterjc commented Nov 11, 2021

peterjc commented Nov 12, 2021

jqnatividad commented Nov 12, 2021

peterjc commented Nov 12, 2021

github-actions bot commented Jan 12, 2022

jqnatividad commented Sep 27, 2023

Feature Request: deduplicate columns/extract unique columns #84

Feature Request: deduplicate columns/extract unique columns #84

Comments

peterjc commented Oct 14, 2021 • edited Loading

jqnatividad commented Oct 14, 2021

peterjc commented Oct 14, 2021

jqnatividad commented Oct 14, 2021

peterjc commented Oct 14, 2021

peterjc commented Oct 14, 2021

jqnatividad commented Oct 14, 2021

eddy-geek commented Oct 14, 2021

jqnatividad commented Oct 14, 2021

jqnatividad commented Oct 19, 2021

peterjc commented Nov 11, 2021 • edited Loading

jqnatividad commented Nov 11, 2021

peterjc commented Nov 11, 2021

peterjc commented Nov 12, 2021

Multi-table join avoiding repeated columns

jqnatividad commented Nov 12, 2021

peterjc commented Nov 12, 2021

github-actions bot commented Jan 12, 2022

jqnatividad commented Sep 27, 2023

peterjc commented Oct 14, 2021 •

edited

Loading

peterjc commented Nov 11, 2021 •

edited

Loading