Skip to content

Fix csq consequence prediction: sref offset, OOB codon, stop codon, splice#2527

Open
sirus20x6 wants to merge 3 commits intosamtools:developfrom
sirus20x6:fix/csq-consequence-prediction
Open

Fix csq consequence prediction: sref offset, OOB codon, stop codon, splice#2527
sirus20x6 wants to merge 3 commits intosamtools:developfrom
sirus20x6:fix/csq-consequence-prediction

Conversation

@sirus20x6
Copy link

Summary

  • Fix trailing N_REF_PAD copy offset in spliced reference — was copying from beginning of last CDS exon instead of after it
  • Clamp codon index to 0-63 in dna2aa/cdna2aa macros — non-ACGT bases (N) produced index 84, reading past the 64-char gencode string
  • Fix stop codon search comparing against tseq.l instead of tseq_stop.l (2 occurrences)
  • Fix test_splice checking hardcoded allele[1] instead of loop variable allele[i]
  • Fix missing pos assignment for reverse-strand HAP_SSS consequences

Fixes #2478. Relates to #2393.

Test plan

  • Existing test suite passes (1920/1920)
  • Verify csq with transcripts containing N bases in reference
  • Verify csq --force no longer crashes on sanity_check_ref assertion

When a reverse-strand compound variant has ibeg forced to HAP_SSS
(due to frameshift + start_lost), csq->pos was set from ref_node
(=iend) but csq_push was called with ibeg's record. Since they are
at different genomic positions, the vbuf lookup at iend's position
could not find ibeg's record, triggering:

  "This should not happen.. <chr>:<pos> <variant>"

Fix by setting csq->pos to ibeg's position in the SSS branch so the
position and record are consistent for the vbuf lookup.
1. tscript_splice_ref: Fix trailing pad copy offset to start after the
   last CDS exon (was incorrectly copying from the beginning of it).

2. dna2aa/cdna2aa/dna2stop/cdna2stop macros: Guard against out-of-bounds
   access when non-ACGT bases (e.g. 'N') produce codon indices > 63.
   Return 'X' (unknown amino acid) or 0 (not a stop) instead.

3. hap_add_csq and test_cds_local: After searching tseq_stop for '*',
   compare the found index against tseq_stop.l (not tseq.l), matching
   the string that was actually iterated.

4. test_splice: Use loop variable rec->d.allele[i] instead of hardcoded
   rec->d.allele[1] when checking for symbolic/missing alleles.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bcftools csq throw sanity_check_ref error and could not pass by --force

1 participant