Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMI Support for hts_SuperDeduper! #261

Merged
merged 20 commits into from
Apr 28, 2024
Merged

UMI Support for hts_SuperDeduper! #261

merged 20 commits into from
Apr 28, 2024

Conversation

bnjenner
Copy link
Collaborator

Hello again,

I have added support for UMI-based PCR deduplication. It functions by just extracting the UMI from the read ID, converting it to bits, and appending it onto the sequence key used for PCR duplicate removal, essentially just extending the sequence. I also added an extra test function in hts_SuperDeduper_test that was useful during development. Additionally, I have ran this new version of the tool on a variety of datasets and it appears to work as intended. I originally wanted to provide some data for this PR, but ultimately I realized that testing fell more in line with bench marking then validating the algorithm. That being said, if you would like to see some of these results, I would be happy to add what I have found. Although, it is worth noting that the effectiveness of this method on single end reads, particularly TAGseq experiments with lower complexity, while comparable to umi_tools, is ultimately worse and for some reason very sensitive to whether or not you use hts_CutTrim before it. I am currently working on finding the best parameter settings for its application to TAGseq.

Anyways, let me know what you guys think, I am excited to finally have some eyes on this one.

@bnjenner
Copy link
Collaborator Author

In hindsight, umitools considers the edit distance when considering the umi in deduplication... let me know if you guys would prefer the implementation include this.

Copy link
Collaborator

@joe-angell joe-angell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, couple minor issues.

@@ -35,6 +35,19 @@ std::string ReadBase::bit_to_str(const BitSet &bits) {
return out;
}

boost::optional<BitSet> ReadBase::bitjoin(const boost::optional<BitSet> &bit1, const boost::optional<BitSet> &bit2, const char& del) {
if (del == '\0') {
return bit2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why return bit2 in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit1 is the umi sequence key, bit2 is the read sequence key. If the delimiter is unset (the umi method is not enabled) it will just return the sequence key and continue with the normal SuperDeduper algorithm. This was ultimately just how I chose to add the umi mode, specifically choosing to only include one parameter to enable it instead of a switch and an option for the delimiter. I also had mixed feelings about this, lol.

Copy link
Collaborator

@joe-angell joe-angell Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would change this to just a bit join that doesn't do any special logic related to super d. Expanding more on this point, the del variable is not necessary for the join, it's only used as a special logic related to superd. This is the programming principle of separation of concerns.

const std::string get_umi(const char& del) {
size_t idx = id.rfind(del);
if (idx == std::string::npos) {
throw HtsRuntimeException("Did not detected extracted UMI. Be sure hts_ExtractUMI is run prior to hts_Superdeduper");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is too specific, maybe "Did not detected extracted UMI. Be sure hts_ExtractUMI is run prior to the current operation"

if (result.size() < 7) {
throw HtsRuntimeException("Read ID misformated. Does not have appropriate number of \":\" delimited columns for DRAGEN UMI format");
}
umi = result[7];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result.size() < 7 should that be < 8 ? you are getting the 8th element here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the read ID can actually only have 7 fields, the 8th field does not exist on a lot reads and is optionally for the UMI, from my understanding. That is why we I check if there are less than 7, which I understood to be the minimum.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is you will get undefined behavior reading result[7] if the vector is size 7, so you need to check the size here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH, you're totally right, I don't know why I thought the check would be < 7, it is definitely < 8 here. Illumina headers would definitely need 8 fields in this situation. Making the changes now.


if (del != '\0') {
umi_seq = "";
for (const auto &r : i -> get_reads_non_const()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use get_reads() when you are not modifying them and using const reference

//check for existance, store or compare quality and replace:
if ( tmpAvg < discard_qual ){ // averge qual must be less than discard_qual, ignored
counters.increment_ignored();
} else if (auto key=i->get_key(start, length)) { // check for duplicate
} else if (auto key=i->bitjoin(umi_bit, (i -> get_key(start, length)), del)) { // check for duplicate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, I thought the UMI was a way to dedup directly? Do we want to use both the key and the umi to dedup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, so umi_tools uses the mapping position as well as the umi for this deduplication. I figured that using UMI's in addition to the sequence key would be a good proxy for this and add to SuperDeduper's algorithm instead of replacing it. But, if there is a better method that we could implement, I will be happy to change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. As for edit distance, i'm not sure there is an easy way to integrate that into superd, do you know how they do that in umi_tools?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would definitely require changing the code a decent bit. I haven't looked at umi_tool's algorithm so I don't know exactly how they do it, and they have a few different methods. But here is what the documentation says:

"All methods start by identifying the reads with the same mapping position.

The simplest methods, unique and percentile, group reads with the exact same UMI. The network-based methods, cluster, adjacency and directional, build networks where nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1). The groups of reads are then defined from the network in a method-specific manner. For all the network-based methods, each read group is equivalent to one read count for the gene."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to say if that feature is wanted you could add it on in a new pr.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I am making the changes you suggested and will push soon. Are we good to merge or should we wait for more feedback?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@bnjenner bnjenner merged commit 337eda0 into s4hts:master Apr 28, 2024
2 checks passed
@msettles
Copy link
Member

msettles commented Jun 6, 2024

somewhere however, now incorrect QUAL characters are getting introduced. Brad have you seen this anywhere else on this branch? In hts_Overlapper

@bnjenner
Copy link
Collaborator Author

bnjenner commented Jun 6, 2024 via email

@msettles
Copy link
Member

msettles commented Jun 6, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants