3 Prime untemplated versus Mismatches #11

mshadbolt · 2018-03-27T21:31:07Z

Hi, it's me again

I have a question about how the isomiRs/seqbuster pipeline is annotating isomiRs. For example I have these two isomiRs that have been categorised as having untemplated additions:

hsa-miR-22-3p.iso.t5:0.t3:tgt.ad:GGT.mm:0 
hsa-miR-25-3p.iso.t5:0.t3:tga.ad:GGA.mm:0

But I realised they could equally be categorised as having a mismatch at the 3rd base in from the three prime end. Is there a particular reason behind favouring one annotation over another?

Also if I had changed the argument canonicalAdd to the default TRUE when importing files with IsomirDataSeqFromFiles would it instead find a mismatch at that position or would it not be separated out? Or perhaps it would depend on the allele frequency of the mismatch? Or are mismatches effectively not called in the last three positions of the read.

Thanks!

The text was updated successfully, but these errors were encountered:

lpantano · 2018-03-30T16:04:23Z

Hi,

that is a very valid question, and this is the story:

There is no enzyme in the literature that support the addition of other nts beside A/U, it is what I called template canonical additions. So those sequences will go away when canonicalAdd = TRUE. Normally, what happens is that those sequences map to other place to the genome, but I am working on showing that in a formal way. The safer is remove them, it is true you can lose real cases, but many of them are cross-mapping events with genome locations.

The reason they are called addition is because there are more than 1 mismatch at the end, and it is easier to explain biologically that as addition than mismatches. But it is true that is arbitrary decision and probably will be wrong some times. It needs more research to have a better rules to decide this, that actually I am trying to do together with other researchers.

I hope that helps.

Cheers

mshadbolt · 2018-04-10T00:21:48Z

Hi,
Thanks for the clarification, it makes sense. But in this case when you specify that 'there are more than 1 mismatch at the end', in the cases I mention above there is only one mismatch to the reference. For example when there are no other 3p changes, the change gets called as an untemplated addition whereas if there are, we get a mismatch at position 20 e.g. from my raw results out of seqbuster .counts file:

seq	name	freq	mir	start	end	mism	add	t3	s5	s3	DB	precursor	ambiguity
AAGCTGCCAGTTGAAGAACTGT	seq_453217_x323508	323508	hsa-miR-22-3p	53	74	0	0	0	GCTAAAGC	CTGTTGCC	miRNA	hsa-mir-22	1
AAGCTGCCAGTTGAAGAACGGT	seq_433474_x29200	29200	hsa-miR-22-3p	53	71	0	GGT	tgt	GCTAAAGC	CTGTTGCC	miRNA	hsa-mir-22	1
AAGCTGCCAGTTGAAGAACNGT	seq_278454_x18742	18742	hsa-miR-22-3p	53	71	0	NGT	tgt	GCTAAAGC	CTGTTGCC	miRNA	hsa-mir-22	1
AAGCTGCCAGTTGAAGAACAGT	seq_22437_x5664	5664	hsa-miR-22-3p	53	71	0	AGT	tgt	GCTAAAGC	CTGTTGCC	miRNA	hsa-mir-22	1
AAGCTGCCAGTTGAAGAACCGT	seq_92605_x3747	3747	hsa-miR-22-3p	53	71	0	CGT	tgt	GCTAAAGC	CTGTTGCC	miRNA	hsa-mir-22	1
AAGCTGCCAGTTGAAGAACGGTA	seq_609383_x546	546	hsa-miR-22-3p	53	74	20GT	A	0	GCTAAAGC	CTGTTGCC	miRNA	hsa-mir-22	1
AAGCTGCCAGTTGAAGAACNGTA	seq_147029_x409	409	hsa-miR-22-3p	53	74	20NT	A	0	GCTAAAGC	CTGTTGCC	miRNA	hsa-mir-22	1
AAGCTGCCAGTTGAAGAACGGTT	seq_219176_x200	200	hsa-miR-22-3p	53	75	20GT	0	T	GCTAAAGC	CTGTTGCC	miRNA	hsa-mir-22	1
AAGCTGCCAGTTGAAGAACNGTT	seq_234952_x130	130	hsa-miR-22-3p	53	75	20NT	0	T	GCTAAAGC	CTGTTGCC	miRNA	hsa-mir-22	1
AAGCTGCCAGTTGAAGAACAGTA	seq_63995_x117	117	hsa-miR-22-3p	53	74	20AT	A	0	GCTAAAGC	CTGTTGCC	miRNA	hsa-mir-22	1

So I guess any single true mismatch in the last three basepairs won't be called as a mismatch if there isn't any other change on the 3 prime end. In this case I don't think it is a real mismatch, and if I use your recommended method of removing 'non-canonical' additions and having a vaf cut-off of 0.2 then these wouldn't make it into my final set of isomiRs anyway but I wanted to point it out as it might not be the behaviour that everyone would expect. I totally understand that it isn't always easy to come up with rules that cover every case and is definitely still an open research question.

lpantano · 2018-04-17T15:16:29Z

Hi,

thanks for looking into this. And actually, I agree totally with you. The rule is, if there are any mismatches in the last 3, then call it as un-template addition. Something that many time will be wrong, but difficult to come with the reality. Probably it would be better to do this, if it is 2 mismatches and not only one.

This is something we can implement easily into mirtop project, that actually would be an output of bcbio, and can be converted into the mirna files needed by isomiRs package. Hopefully, we'll improve a lot all these calling during the next months when we get to compare the right data to come up with the best conclusion.

If you are interesting to participate in that point of the project, let me know, and I would be happy to add you.

Thanks for all the feedback you add here.

Cheers

mshadbolt · 2018-04-17T16:11:25Z

Yes it is always tricky trying to figure out the best way to call things when there could be many ways of getting to the sequence we detect. Have you also thought about integrating dbSNP annotation to identify common SNPs that might be causing mismatches? I have found in my data there are a few mismatches that coincide with SNPs but it isn't so easy to track since the annotation output doesn't have genomic coordinates. It would be a great addition but could be kind of complicated to implement, particularly for the miRNAs that can come from multiple regions of the genome.

No worries at all. Thanks for all the development you do for the tools in the miRNA and smallncRNA field! I had a look at the mirtop project and I'd be happy to help contribute if there's anything I can do. I will keep an eye on the issues there and see if I can help with anything.

Cheers,
Marion

lpantano · 2018-04-18T14:19:53Z

Yes, actually this is something it would be good to have now that mirtop is centralizing the format. The code is there actually, and ideally for the next BOSC codeFest we can have this quite close to be a reality. I’ll add this to the list of issue in GitHub! Thanks for keeping an eye on mirtop, I am sure you can help someway. Cheers

…

On Apr 17, 2018, at 12:11 PM, Marion ***@***.***> wrote: Yes it is always tricky trying to figure out the best way to call things when there could be many ways of getting to the sequence we detect. Have you also thought about integrating dbSNP annotation to identify common SNPs that might be causing mismatches? I have found in my data there are a few mismatches that coincide with SNPs but it isn't so easy to track since the annotation output doesn't have genomic coordinates. It would be a great addition but could be kind of complicated to implement, particularly for the miRNAs that can come from multiple regions of the genome. No worries at all. Thanks for all the development you do for the tools in the miRNA and smallncRNA field! I had a look at the mirtop project and I'd be happy to help contribute if there's anything I can do. I will keep an eye on the issues there and see if I can help with anything. Cheers, Marion — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#11 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABi_HPUC9WoM9InoQ4FmMQLgYIrMTMdyks5tphQwgaJpZM4S9q-g>.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3 Prime untemplated versus Mismatches #11

3 Prime untemplated versus Mismatches #11

mshadbolt commented Mar 27, 2018

lpantano commented Mar 30, 2018

mshadbolt commented Apr 10, 2018

lpantano commented Apr 17, 2018

mshadbolt commented Apr 17, 2018

lpantano commented Apr 18, 2018 via email

3 Prime untemplated versus Mismatches #11

3 Prime untemplated versus Mismatches #11

Comments

mshadbolt commented Mar 27, 2018

lpantano commented Mar 30, 2018

mshadbolt commented Apr 10, 2018

lpantano commented Apr 17, 2018

mshadbolt commented Apr 17, 2018

lpantano commented Apr 18, 2018 via email