Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3 Prime untemplated versus Mismatches #11

Open
mshadbolt opened this issue Mar 27, 2018 · 5 comments
Open

3 Prime untemplated versus Mismatches #11

mshadbolt opened this issue Mar 27, 2018 · 5 comments

Comments

@mshadbolt
Copy link

Hi, it's me again

I have a question about how the isomiRs/seqbuster pipeline is annotating isomiRs. For example I have these two isomiRs that have been categorised as having untemplated additions:

hsa-miR-22-3p.iso.t5:0.t3:tgt.ad:GGT.mm:0 
hsa-miR-25-3p.iso.t5:0.t3:tga.ad:GGA.mm:0

But I realised they could equally be categorised as having a mismatch at the 3rd base in from the three prime end. Is there a particular reason behind favouring one annotation over another?

Also if I had changed the argument canonicalAdd to the default TRUE when importing files with IsomirDataSeqFromFiles would it instead find a mismatch at that position or would it not be separated out? Or perhaps it would depend on the allele frequency of the mismatch? Or are mismatches effectively not called in the last three positions of the read.

Thanks!

@lpantano
Copy link
Owner

Hi,

that is a very valid question, and this is the story:

There is no enzyme in the literature that support the addition of other nts beside A/U, it is what I called template canonical additions. So those sequences will go away when canonicalAdd = TRUE. Normally, what happens is that those sequences map to other place to the genome, but I am working on showing that in a formal way. The safer is remove them, it is true you can lose real cases, but many of them are cross-mapping events with genome locations.

The reason they are called addition is because there are more than 1 mismatch at the end, and it is easier to explain biologically that as addition than mismatches. But it is true that is arbitrary decision and probably will be wrong some times. It needs more research to have a better rules to decide this, that actually I am trying to do together with other researchers.

I hope that helps.

Cheers

@mshadbolt
Copy link
Author

Hi,
Thanks for the clarification, it makes sense. But in this case when you specify that 'there are more than 1 mismatch at the end', in the cases I mention above there is only one mismatch to the reference. For example when there are no other 3p changes, the change gets called as an untemplated addition whereas if there are, we get a mismatch at position 20 e.g. from my raw results out of seqbuster .counts file:

seq name freq mir start end mism add t5 t3 s5 s3 DB precursor ambiguity
AAGCTGCCAGTTGAAGAACTGT seq_453217_x323508 323508 hsa-miR-22-3p 53 74 0 0 0 0 GCTAAAGC CTGTTGCC miRNA hsa-mir-22 1
AAGCTGCCAGTTGAAGAACGGT seq_433474_x29200 29200 hsa-miR-22-3p 53 71 0 GGT 0 tgt GCTAAAGC CTGTTGCC miRNA hsa-mir-22 1
AAGCTGCCAGTTGAAGAACNGT seq_278454_x18742 18742 hsa-miR-22-3p 53 71 0 NGT 0 tgt GCTAAAGC CTGTTGCC miRNA hsa-mir-22 1
AAGCTGCCAGTTGAAGAACAGT seq_22437_x5664 5664 hsa-miR-22-3p 53 71 0 AGT 0 tgt GCTAAAGC CTGTTGCC miRNA hsa-mir-22 1
AAGCTGCCAGTTGAAGAACCGT seq_92605_x3747 3747 hsa-miR-22-3p 53 71 0 CGT 0 tgt GCTAAAGC CTGTTGCC miRNA hsa-mir-22 1
AAGCTGCCAGTTGAAGAACGGTA seq_609383_x546 546 hsa-miR-22-3p 53 74 20GT A 0 0 GCTAAAGC CTGTTGCC miRNA hsa-mir-22 1
AAGCTGCCAGTTGAAGAACNGTA seq_147029_x409 409 hsa-miR-22-3p 53 74 20NT A 0 0 GCTAAAGC CTGTTGCC miRNA hsa-mir-22 1
AAGCTGCCAGTTGAAGAACGGTT seq_219176_x200 200 hsa-miR-22-3p 53 75 20GT 0 0 T GCTAAAGC CTGTTGCC miRNA hsa-mir-22 1
AAGCTGCCAGTTGAAGAACNGTT seq_234952_x130 130 hsa-miR-22-3p 53 75 20NT 0 0 T GCTAAAGC CTGTTGCC miRNA hsa-mir-22 1
AAGCTGCCAGTTGAAGAACAGTA seq_63995_x117 117 hsa-miR-22-3p 53 74 20AT A 0 0 GCTAAAGC CTGTTGCC miRNA hsa-mir-22 1

So I guess any single true mismatch in the last three basepairs won't be called as a mismatch if there isn't any other change on the 3 prime end. In this case I don't think it is a real mismatch, and if I use your recommended method of removing 'non-canonical' additions and having a vaf cut-off of 0.2 then these wouldn't make it into my final set of isomiRs anyway but I wanted to point it out as it might not be the behaviour that everyone would expect. I totally understand that it isn't always easy to come up with rules that cover every case and is definitely still an open research question.

@lpantano
Copy link
Owner

Hi,

thanks for looking into this. And actually, I agree totally with you. The rule is, if there are any mismatches in the last 3, then call it as un-template addition. Something that many time will be wrong, but difficult to come with the reality. Probably it would be better to do this, if it is 2 mismatches and not only one.

This is something we can implement easily into mirtop project, that actually would be an output of bcbio, and can be converted into the mirna files needed by isomiRs package. Hopefully, we'll improve a lot all these calling during the next months when we get to compare the right data to come up with the best conclusion.

If you are interesting to participate in that point of the project, let me know, and I would be happy to add you.

Thanks for all the feedback you add here.

Cheers

@mshadbolt
Copy link
Author

Yes it is always tricky trying to figure out the best way to call things when there could be many ways of getting to the sequence we detect. Have you also thought about integrating dbSNP annotation to identify common SNPs that might be causing mismatches? I have found in my data there are a few mismatches that coincide with SNPs but it isn't so easy to track since the annotation output doesn't have genomic coordinates. It would be a great addition but could be kind of complicated to implement, particularly for the miRNAs that can come from multiple regions of the genome.

No worries at all. Thanks for all the development you do for the tools in the miRNA and smallncRNA field! I had a look at the mirtop project and I'd be happy to help contribute if there's anything I can do. I will keep an eye on the issues there and see if I can help with anything.

Cheers,
Marion

@lpantano
Copy link
Owner

lpantano commented Apr 18, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants