Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Tool: hts_ExtractUMI #257

Merged
merged 17 commits into from
Apr 13, 2024
Merged

New Tool: hts_ExtractUMI #257

merged 17 commits into from
Apr 13, 2024

Conversation

bnjenner
Copy link
Collaborator

Hey there,

This PR adds a tool called hts_ExtractUMI. As the name implies, it is meant to sort of be a drop in replacement for umi_tools extract with some different functionality so that it fits the HTStream philosophy, mainly streaming, and some single cell pipelines a bit better. It is based on a Python script Matt wrote, so I had a good idea of where to start. I tried my best to preserve the style and organization of HTStream and also wrote a bunch of test functions. As of now, all the tests pass for all the functions and the output from other programs is unaffected, so I am pretty sure nothing got messed up even though I added some functions to read.h and utils.h.

I imagine you guys will have some questions, comments, and request some changes (particularly to the name, lol), but I figured now is a good time to get some other eyes on the code and bring attention to this PR. I don't want to dump a bunch of code into the repo all at once.

Copy link
Collaborator

@joe-angell joe-angell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the couple minor comments but other than that it looks good.

common/src/read.h Outdated Show resolved Hide resolved
ExtractUMI() {
program_name = "hts_ExtractUMI";
app_description =
"The hts_ExtractUMI application trims a set number of bases from the 5'\n";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 5 here a typo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 5' is short for 5 prime end of the read. I know you guys use left and right to describe things in HTStream, would you guys prefer that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah that would be the left end correct? I'm not a bio expert ;P.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol yes the left end. I will add that to the description. :)

@bnjenner bnjenner merged commit 41c1c6f into s4hts:master Apr 13, 2024
2 checks passed
}

if (!umi.discard) {
r.set_id_first(r.get_id_first() + "_" + umi.seq);
Copy link
Collaborator

@samhunter samhunter Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnjenner, in discussing this with @dstreett, he pointed out the use of "_" to append the UMI. I'm not sure if there is an industry standard for this, but for Illumina's DRAGEN platform, the approach is to use a ":" like the rest of the fields in the read, e.g.

Read name—The UMI sequence is located in the eighth colon-delimited field of the read name (QNAME). For example, NDX550136:7:H2MTNBDXX:1:13302:3141:10799:AAGGATG+TCGGAGA
From: https://support-docs.illumina.com/SW/dragen_v42/Content/SW/DRAGEN/UMIs.htm

For UMI-tools it looks like they use an "_" (https://umi-tools.readthedocs.io/en/latest/QUICK_START.html).

Maybe setting including a parameter that makes it possible for the user to set this would be a good idea? Perhaps with a default of "_" ?

Copy link
Collaborator Author

@bnjenner bnjenner Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also was thinking about this a bit as I was trying to add umi support for superdeduper...

Seems this might be a standard now. UCD has seen a lot of Aviti sequencing lately (sorry Sam) and it seems like those headers follow the same format. I will add this to the to-do list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks @bnjenner!

Copy link
Collaborator Author

@bnjenner bnjenner Apr 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so as of now, I have added two new parameters, --delimiter and --DRAGEN, that I am ready to create a new PR for. Delimiter does what you'd expect, DRAGEN on the other hand, I use to enforce the DRAGEN formatting on single and paired end reads according to the link you sent me @samhunter

Of course, hts_ExtractUMI is still designed to append the UMI to the end of the read ID, as I kinda assumed the 8th ":" delimited column would always be the end of the read ID. Is this a safe assumption to make? Or should I be more explicit in where the UMI goes when the DRAGEN parameter is given. I can't find much info about this new read ID format.

Copy link
Collaborator

@samhunter samhunter Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a fairly safe assumption. If "--DRAGEN" is meant to make the input compatible for the DRAGEN consensus generator, then it should be 8th column, which should always be the end of the read ID (unless you get your reads from SRA or something, and then all bets are off).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants