-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Tool: hts_ExtractUMI #257
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the couple minor comments but other than that it looks good.
ExtractUMI() { | ||
program_name = "hts_ExtractUMI"; | ||
app_description = | ||
"The hts_ExtractUMI application trims a set number of bases from the 5'\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 5
here a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 5' is short for 5 prime end of the read. I know you guys use left and right to describe things in HTStream, would you guys prefer that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah that would be the left end correct? I'm not a bio expert ;P.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol yes the left end. I will add that to the description. :)
} | ||
|
||
if (!umi.discard) { | ||
r.set_id_first(r.get_id_first() + "_" + umi.seq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnjenner, in discussing this with @dstreett, he pointed out the use of "_" to append the UMI. I'm not sure if there is an industry standard for this, but for Illumina's DRAGEN platform, the approach is to use a ":" like the rest of the fields in the read, e.g.
Read name—The UMI sequence is located in the eighth colon-delimited field of the read name (QNAME). For example, NDX550136:7:H2MTNBDXX:1:13302:3141:10799:AAGGATG+TCGGAGA
From: https://support-docs.illumina.com/SW/dragen_v42/Content/SW/DRAGEN/UMIs.htm
For UMI-tools it looks like they use an "_" (https://umi-tools.readthedocs.io/en/latest/QUICK_START.html).
Maybe setting including a parameter that makes it possible for the user to set this would be a good idea? Perhaps with a default of "_" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also was thinking about this a bit as I was trying to add umi support for superdeduper...
Seems this might be a standard now. UCD has seen a lot of Aviti sequencing lately (sorry Sam) and it seems like those headers follow the same format. I will add this to the to-do list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks @bnjenner!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so as of now, I have added two new parameters, --delimiter and --DRAGEN, that I am ready to create a new PR for. Delimiter does what you'd expect, DRAGEN on the other hand, I use to enforce the DRAGEN formatting on single and paired end reads according to the link you sent me @samhunter
Of course, hts_ExtractUMI is still designed to append the UMI to the end of the read ID, as I kinda assumed the 8th ":" delimited column would always be the end of the read ID. Is this a safe assumption to make? Or should I be more explicit in where the UMI goes when the DRAGEN parameter is given. I can't find much info about this new read ID format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a fairly safe assumption. If "--DRAGEN" is meant to make the input compatible for the DRAGEN consensus generator, then it should be 8th column, which should always be the end of the read ID (unless you get your reads from SRA or something, and then all bets are off).
Hey there,
This PR adds a tool called hts_ExtractUMI. As the name implies, it is meant to sort of be a drop in replacement for umi_tools extract with some different functionality so that it fits the HTStream philosophy, mainly streaming, and some single cell pipelines a bit better. It is based on a Python script Matt wrote, so I had a good idea of where to start. I tried my best to preserve the style and organization of HTStream and also wrote a bunch of test functions. As of now, all the tests pass for all the functions and the output from other programs is unaffected, so I am pretty sure nothing got messed up even though I added some functions to read.h and utils.h.
I imagine you guys will have some questions, comments, and request some changes (particularly to the name, lol), but I figured now is a good time to get some other eyes on the code and bring attention to this PR. I don't want to dump a bunch of code into the repo all at once.