Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: deduplicate columns/extract unique columns #84

Closed
peterjc opened this issue Oct 14, 2021 · 17 comments
Closed

Feature Request: deduplicate columns/extract unique columns #84

peterjc opened this issue Oct 14, 2021 · 17 comments

Comments

@peterjc
Copy link

peterjc commented Oct 14, 2021

Cross reference BurntSushi/xsv#283

We can use qsv dedup or the Unix command line tools sort and uniq to remove duplicate rows in plain text table, but I find myself wanting to do something similar with duplicated columns.

For example, after doing qsv join ... there will be at least one pair of duplicated columns (the values used for the join).

I am hoping for something like a column based version of the row based qsv dedup command (see #26).

I suspect I could workaround this via the qsv transpose command (see #3).

@jqnatividad
Copy link
Collaborator

Another workaround is select.

But while I was looking into this, I saw this pending PR to add a --merge option to join.

eddy-geek/xsv@643683c

I'm now looking to adapt it to qsv as well. :)

@peterjc
Copy link
Author

peterjc commented Oct 14, 2021

I had spotted BurntSushi/xsv#114 too, and the --merge idea in the join command would help.

My real use case is merging several "join" tables from another tool, which all share the first dozen columns (and values).

@jqnatividad
Copy link
Collaborator

Got it. You're not just deduping duplicate headers.
And like you said, doing a transpose, then a dedup, then another transpose should do the trick. Have you tried that?

@peterjc
Copy link
Author

peterjc commented Oct 14, 2021

Not tested yet - I got sidetracked by conda-forge (see #85), but could try your pre-compiled binaries instead.

@peterjc
Copy link
Author

peterjc commented Oct 14, 2021

The transpose/dedup/transpose workaround isn't quite what I wanted as it has also sorted the columns (and I wanted to preserve the order keeping the first occurrence only). I wonder how often people would want a row-based dedup which preserves order?

Using select would probably be best although I may have to construct the desired column list by hand, which will be tedious.

@jqnatividad
Copy link
Collaborator

And don't forget that select allows you to differentiate between identically named columns with the [] selector, e.g.

qsv select 'Foo[2]'

to select the second column named 'Foo'.

Also, maybe you can use the headers command to extract the column names, eliminate the identically-named columns, and then use that for select?

Regardless, if you come up with a useful recipe, please do share it in the Cookbook.

@eddy-geek
Copy link
Contributor

I can redo the PR here if it helps.
(if it does not conflict too much, my rust is... rusty)

@jqnatividad
Copy link
Collaborator

@eddy-geek Please do!
I'm using qsv myself to rust up... 😉

@jqnatividad
Copy link
Collaborator

Hi @eddy-geek , just wanted to give you a heads-up that I modified join to have left-semi and left-anti joins...

As is, they only take columns from the left relation, so it shouldn't affect your PR for deduping column names...

@peterjc
Copy link
Author

peterjc commented Nov 11, 2021

Looking at #89 and #90, while --merge was briefly merged (to drop the duplicated columns which the join produces), it was reverted to to a performance regression.

If that was working, it would solve my use case fairly well. Here I merge on column 3 of base_fields.tsv and column 2 of source_*.tsv (which becomes column 1 after the cut operation to discard all the other repeated columns):

cp base_fields.tsv working.tsv
for TSV in source_*.tsv; do
   xsv join --right 3 working.tsv 1 <(cut -f 2,36- $TSV) | xsv fmt -t "\t" > new.tsv
   mv new.tsv working.tsv
done

That looks to be working nicely, other than the duplication of the join column.

(My original request of a column deduplication command would make this even easier)

@jqnatividad
Copy link
Collaborator

@peterjc , can you add that to the Cookbook?

Hopefully, @eddy-geek can redo his old PR and we can get the --merge option.

BTW, the performance regression may have been a false positive... I just installed WSL at the time and I have since uninstalled it. I ran the benchmarks on WSL and it was giving some bad numbers which I may have unnecessarily attributed to the PR.

@peterjc
Copy link
Author

peterjc commented Nov 11, 2021

Ah, the wiki page https://github.com/jqnatividad/qsv/wiki/Cookbook#cookbook - I could do that. Maybe a simpler version with CSV files only.

@peterjc
Copy link
Author

peterjc commented Nov 12, 2021

I don't see that I can edit the wiki (likely restricted to collaborators which is fine), so suggested text:

Multi-table join avoiding repeated columns

This example was inspired by having to combine multiple tables exported from another system, which were themselves from multiple database joins. Suppose you have have several tables (table_*.csv) which have the same first 10 columns, and then a varying number of additional columns. The column we want to join on is column 2, and for simplicity assume the rows all match perfectly (otherwise you would explore the left and right join options).

cp table_A.csv combined.csv
for NEXT in table_B.csv table_C.csv table_D.csv; do
    qsv join --merge 2 combined.csv 1 <(qsv select 2,11- $NEXT) > new.csv
    mv new.csv combined.csv
done

We use a loop to perform multiple joins. Each time we use xsv select to pull out the index (join column 2) and the columns unique to that file (11 onwards), which could also be done with cut -s "," -f 1,11- $NEXT if preferred. The join column becomes column 1 of the intermediate file.

The --merge option stops duplication of the join column.

@jqnatividad
Copy link
Collaborator

@peterjc , this is awesome. Thanks!

I just opened up the wiki, do you mind adding the article yourself?

I really want the wiki to be a community resource, and being one of the early qsv adopters, I'd really appreciate it if you make the first community contribution to it!

@peterjc
Copy link
Author

peterjc commented Nov 12, 2021

Done.

@github-actions
Copy link

Stale issue message

@jqnatividad
Copy link
Collaborator

Should somebody stumble into this - the polars powered joinp command does not have this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants