Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ispalindromic with respect to reverse or reverse_complement? #303

Closed
Seelengrab opened this issue Mar 22, 2024 · 6 comments · Fixed by #304
Closed

ispalindromic with respect to reverse or reverse_complement? #303

Seelengrab opened this issue Mar 22, 2024 · 6 comments · Fixed by #304

Comments

@Seelengrab
Copy link

Seelengrab commented Mar 22, 2024

I've been trying to fuzz the parsers of BioSequences a bit using Supposition.jl, making use of the very nice interfaces you have (I haven't found any actual bugs so far, and performance is absolutely stellar even for humongous inputs, good job everyone! :D) and came across this:

julia> ispalindromic(LongDNA{4}("AA"))
false

julia> ispalindromic(LongDNA{4}("AT"))
true

Now, I'm not a Bioinformatics guy so this may just be some insider knowledge I don't have, and the docstring of ispalindromic doesn't say, but which of reverse and reverse_complement is ispalindromic referring to when it says that the sequence is palindromic? The current implementation is consistent with reverse_complement, but not with reverse.

If this is just missing from the docs, something like this should clear it up:

diff --git a/src/biosequence/predicates.jl b/src/biosequence/predicates.jl
index 37c2e9d..672680b 100644
--- a/src/biosequence/predicates.jl
+++ b/src/biosequence/predicates.jl
@@ -61,7 +61,18 @@ end
 """
     ispalindromic(seq::BioSequence)
 
-Return `true` if `seq` is a palindromic sequence; otherwise return `false`.
+Return `true` if `seq` is a palindromic sequence with respect to `reverse_complement`;
+otherwise return `false`.
+
+```
+# with respect to `reverse`
+julia> ispalindromic(LongDNA{4}("CGGC"))
+false
+
+# with respect to `reverse_complement`
+julia> ispalindromic(LongDNA{4}("CGCG"))
+true
+```
 """
 function ispalindromic(seq::BioSequence{<:NucleicAcidAlphabet})
     for i in 1:cld(length(seq), 2)

This was found using the following fuzzing setup:

julia> using Supposition, BioSequences

julia> dna_string = Data.Text(Data.SampledFrom(('A','C','T','G')));

julia> palindromic_dna = map(dna_string) do s
           b = LongDNA{4}(s)
           b*reverse(b)
       end;

julia> @check ispalindromic(palindromic_dna);
┌ Error: Property doesn't hold!
│   Description = "ispalindromic"
│   Example = (AA,)
└ @ Supposition ~/.julia/packages/Supposition/KpGkN/src/testset.jl:292
Test Summary: | Fail  Total  Time
ispalindromic |    1      1  0.0s

the complementary construction with reverse_complement passes successfully (it tries 10_000 sequences by default, so that's why it takes almost a second to run through):

julia> palindromic_compl_dna = map(dna_string) do s
           b = LongDNA{4}(s)
           b*reverse_complement(b)
       end;

julia> @check ispalindromic(palindromic_compl_dna);
Test Summary: | Pass  Total  Time
ispalindromic |    1      1  0.8s
@Seelengrab
Copy link
Author

Seelengrab commented Mar 22, 2024

I've noticed that the fuzzer I've been using above is a bit too friendly, only producing inputs with even length. Adjusted for that, ispalindromic claims that inputs with odd length aren't palindromic either, no matter with respect to which reverse they are constructed:

julia> using Supposition, BioSequences

julia> dna_string = Data.Text(Data.SampledFrom(('A','C','T','G')));

julia> @check (s=dna_string, m=Data.Just("") | Data.SampledFrom(('A','C','T','G'))) -> begin
           b = LongDNA{4}(s)
           c = b*LongDNA{4}(m)*reverse_complement(b)
           event!("Complete sequence", c)
           ispalindromic(c)
       end;
Events occured: 1
    Complete sequence
        A
┌ Error: Property doesn't hold!
│   Description = "##SuppositionAnon#560"
│   Example = (s = "", m = 'A')
└ @ Supposition ~/.julia/packages/Supposition/KpGkN/src/testset.jl:292
Test Summary:         | Fail  Total  Time
##SuppositionAnon#560 |    1      1  0.0s

I'm not a Bioinformatician so I don't really know what the exact semantics of ispalindromic are and considering the current implementation is consistent with reverse_complement, this may be intentional. If I'm understanding this correctly, there are no self-complementary nucleotides, so with respect to reverse_complement, no DNA sequence of odd length can ever be considered palindromic, right? If that's true, it would probably be good to mention that in the docstring too. Something like this maybe:

!!! note "Odd length sequences"
     Since `ispalindromic` is defined with respect to `reverse_complement` and there are no self-complementary
     nucleotides, sequences with odd `length` are never palindromic.

Could also be an opportunity for some optimization, since the length is cached as far as I can tell.

@jakobnissen
Copy link
Member

You're right - palindromic sequences are defined in terms of reverse-complementation since this is more relevant biologically (nucleotide reversion doesn't occur in nature, but reverse-comeplementation happens all the time). It's also true that odd-length sequences can never be palindromic. I recall vaguely that this is leveraged in genome assemblers, that only use odd-length kmers, which simplifies the implementation of strand-specific assembly de Bruijn graphs.

A documentation update is in order. The current docstring is not very.. explanatory.

@jakobnissen
Copy link
Member

Ah, the tests has a counterexample with a palindromic odd-length sequence: ANT. This is because N is a degenerate base meaning "unknown" base.

@jakobnissen jakobnissen linked a pull request Mar 22, 2024 that will close this issue
@Seelengrab
Copy link
Author

Ah, the tests has a counterexample with a palindromic odd-length sequence: ANT. This is because N is a degenerate base meaning "unknown" base.

A case that my very naive fuzzing attempt with "ACTG" obviously doesn't hit, makes sense! Glad my naive fuzzing can help clarify that for either other newbies to Bioinformatics or forgetful people :D

@kescobo
Copy link
Member

kescobo commented Mar 22, 2024

good job everyone!

Let's be honest - good job @jakobnissen and @bicycle1885 :-P

@Seelengrab
Copy link
Author

Yes indeed - thanks too for the quick turnaround here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants