Skip to content

Improve CRAM reader input validation#1991

Merged
daviesrob merged 10 commits intosamtools:developfrom
daviesrob:cram-work
Mar 18, 2026
Merged

Improve CRAM reader input validation#1991
daviesrob merged 10 commits intosamtools:developfrom
daviesrob:cram-work

Conversation

@daviesrob
Copy link
Copy Markdown
Member

@daviesrob daviesrob commented Mar 16, 2026

Fixes multiple cases where the CRAM reader could misbehave on reading invalid input files.

daviesrob and others added 10 commits March 12, 2026 12:01
The old version triggered one byte too late.  However nuance is
needed as it's permissible to completely fill the output buffer
as long as the next input character is the stop byte.
These codecs do not currently support BYTE_ARRAY type
outputs.  It's necessary to reject them if a file attempts
to specify them in a context where array data is expected.
* Ensure ref_id is valid when reading CRAM RI data series

In multi-ref containers, the reference IDs are stored in the RI
data series.  These need to be checked to ensure they are in
the expected range.

* Add check to ensure mate ref id is valid when decoding cram
This error can be caught early when reading the AP data series.
Records that span beyond the end of the reference are dealt
with later.
* Fix out-by-one in check for CRAM BB series data

* Fix out-by-one in check for CRAM BA series data

* Reject cram reads that end past slice reference extents

  Copies a check added in efe502d and applies it to the case
  where matching bases are added from the last feature up to
  the reported read length, and the ending reference position
  is beyond the reference length listed in the header.
The check needs to deal with:

 * Records where no seq / qual was stored (which have cr->len == 0)
 * Features which write to seq / qual, which cannot be placed
   after the end of the read
 * Features like deletions and ref skips that do not write to
   seq or qual, and may be placed immediately after the end of
   the read
In SAM, these have sequence "*".  In CRAM, this is indicated
by setting cram_record::len to 0, although strangely some
dummy sequence data needs to be stored in the file for these
records so that feature lengths can be returned (e.g. for CIGAR
matches).  In such cases, the data read needs to be discarded,
and nothing should be written to the seq or qual strings.
Quality data probably shouldn't be present either for these,
but if it is, it also needs to be discarded. This commit fixes
some cases where this was not being handled correctly.

 * The code for copying over unchanged parts of the reference
   needs to skip the actual copy when seq is not stored.

 * The soft clip handler has fall-backs for when the IN or SC
   data stream is missing.  These need to be prevented from
   writing to seq if it isn't present.

 * When processing the QS data stream, suppress storing
   the quality values when cr->len == 0.

 * If CRAM_FLAG_PRESERVE_QUAL_SCORES is false, don't try
   to add default quality values if cr->len == 0, and also
   don't try to check that the first quality value is 255
   (which is used for the case where seq was present but
   the quality values were not).
Setting out to NULL indicates that the data should be
discarded.  This is mainly used in the case of records
where the sequence or quality values were not present,
but some dummy data is needed in order to reconstruct
the CIGAR and MD records.

* Make const and xrle char decoders work with out == NULL

* Add missed case for NULL out passed to xpack char decoder
@daviesrob daviesrob changed the title Placeholder for cram work Improve CRAM reader input validation Mar 18, 2026
@daviesrob daviesrob marked this pull request as ready for review March 18, 2026 17:04
@daviesrob daviesrob merged commit 654a04a into samtools:develop Mar 18, 2026
9 checks passed
@daviesrob daviesrob deleted the cram-work branch March 19, 2026 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants