New Tool: hts_ExtractUMI #257

bnjenner · 2024-04-12T23:29:50Z

Hey there,

This PR adds a tool called hts_ExtractUMI. As the name implies, it is meant to sort of be a drop in replacement for umi_tools extract with some different functionality so that it fits the HTStream philosophy, mainly streaming, and some single cell pipelines a bit better. It is based on a Python script Matt wrote, so I had a good idea of where to start. I tried my best to preserve the style and organization of HTStream and also wrote a bunch of test functions. As of now, all the tests pass for all the functions and the output from other programs is unaffected, so I am pretty sure nothing got messed up even though I added some functions to read.h and utils.h.

I imagine you guys will have some questions, comments, and request some changes (particularly to the name, lol), but I figured now is a good time to get some other eyes on the code and bring attention to this PR. I don't want to dump a bunch of code into the repo all at once.

joe-angell

See the couple minor comments but other than that it looks good.

common/src/read.h

joe-angell · 2024-04-13T00:54:07Z

hts_ExtractUMI/src/hts_ExtractUMI.h

+    ExtractUMI() {
+        program_name = "hts_ExtractUMI";
+        app_description =
+            "The hts_ExtractUMI application trims a set number of bases from the 5'\n";


Is the 5 here a typo?

The 5' is short for 5 prime end of the read. I know you guys use left and right to describe things in HTStream, would you guys prefer that?

Oh yeah that would be the left end correct? I'm not a bio expert ;P.

Lol yes the left end. I will add that to the description. :)

samhunter · 2024-04-13T17:19:07Z

hts_ExtractUMI/src/hts_ExtractUMI.h

+        }
+
+        if (!umi.discard) {
+            r.set_id_first(r.get_id_first() + "_" + umi.seq);


@bnjenner, in discussing this with @dstreett, he pointed out the use of "_" to append the UMI. I'm not sure if there is an industry standard for this, but for Illumina's DRAGEN platform, the approach is to use a ":" like the rest of the fields in the read, e.g.

Read name—The UMI sequence is located in the eighth colon-delimited field of the read name (QNAME). For example, NDX550136:7:H2MTNBDXX:1:13302:3141:10799:AAGGATG+TCGGAGA
From: https://support-docs.illumina.com/SW/dragen_v42/Content/SW/DRAGEN/UMIs.htm

For UMI-tools it looks like they use an "_" (https://umi-tools.readthedocs.io/en/latest/QUICK_START.html).

Maybe setting including a parameter that makes it possible for the user to set this would be a good idea? Perhaps with a default of "_" ?

I also was thinking about this a bit as I was trying to add umi support for superdeduper...

Seems this might be a standard now. UCD has seen a lot of Aviti sequencing lately (sorry Sam) and it seems like those headers follow the same format. I will add this to the to-do list.

Sounds good, thanks @bnjenner!

Ok so as of now, I have added two new parameters, --delimiter and --DRAGEN, that I am ready to create a new PR for. Delimiter does what you'd expect, DRAGEN on the other hand, I use to enforce the DRAGEN formatting on single and paired end reads according to the link you sent me @samhunter

Of course, hts_ExtractUMI is still designed to append the UMI to the end of the read ID, as I kinda assumed the 8th ":" delimited column would always be the end of the read ID. Is this a safe assumption to make? Or should I be more explicit in where the UMI goes when the DRAGEN parameter is given. I can't find much info about this new read ID format.

I think that's a fairly safe assumption. If "--DRAGEN" is meant to make the input compatible for the DRAGEN consensus generator, then it should be 8th column, which should always be the end of the read ID (unless you get your reads from SRA or something, and then all bets are off).

bnjenner added 16 commits March 30, 2024 09:35

ExtractUMI init

aae7bed

Read number handler

f3c0fa0

extract_umi function cleanup

6613b57

hts_ExtractUMI help update

267028d

UMI Quality Threshold

6a58bef

update str compare and help page

248936b

hts_ExtractUMI restructure + N and Homopolymer filter

583699e

hts_ExtractUMI help typo fix

d32e706

bad cast fix, oops lol

265fcf0

write_options cast fix, char

79c652f

added support for PE UMIs and added UMI struct

ec5217e

check_char_range() to check_values(), more general use

6d10fb7

regression tests

f9e1d88

test json was not using default parameters

b61f223

hts_ExtractUMI test json fix... again

b4935bd

hts_ExtractUMI test json fix... again

4b2689e

joe-angell approved these changes Apr 13, 2024

View reviewed changes

copy fix (read.h) and description update for hts_ExtractUMI

ee393b9

bnjenner merged commit 41c1c6f into s4hts:master Apr 13, 2024
2 checks passed

samhunter reviewed Apr 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Tool: hts_ExtractUMI #257

New Tool: hts_ExtractUMI #257

bnjenner commented Apr 12, 2024

joe-angell left a comment

joe-angell Apr 13, 2024

bnjenner Apr 13, 2024

joe-angell Apr 13, 2024

bnjenner Apr 13, 2024

samhunter Apr 13, 2024 •

edited

Loading

bnjenner Apr 13, 2024 •

edited

Loading

samhunter Apr 13, 2024

bnjenner Apr 14, 2024 •

edited

Loading

samhunter Apr 18, 2024 •

edited

Loading

New Tool: hts_ExtractUMI #257

New Tool: hts_ExtractUMI #257

Conversation

bnjenner commented Apr 12, 2024

joe-angell left a comment

Choose a reason for hiding this comment

joe-angell Apr 13, 2024

Choose a reason for hiding this comment

bnjenner Apr 13, 2024

Choose a reason for hiding this comment

joe-angell Apr 13, 2024

Choose a reason for hiding this comment

bnjenner Apr 13, 2024

Choose a reason for hiding this comment

samhunter Apr 13, 2024 • edited Loading

Choose a reason for hiding this comment

bnjenner Apr 13, 2024 • edited Loading

Choose a reason for hiding this comment

samhunter Apr 13, 2024

Choose a reason for hiding this comment

bnjenner Apr 14, 2024 • edited Loading

Choose a reason for hiding this comment

samhunter Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

samhunter Apr 13, 2024 •

edited

Loading

bnjenner Apr 13, 2024 •

edited

Loading

bnjenner Apr 14, 2024 •

edited

Loading

samhunter Apr 18, 2024 •

edited

Loading