Improve CRAM reader input validation by daviesrob · Pull Request #1991 · samtools/htslib

daviesrob · 2026-03-16T10:19:16Z

Fixes multiple cases where the CRAM reader could misbehave on reading invalid input files.

The old version triggered one byte too late. However nuance is needed as it's permissible to completely fill the output buffer as long as the next input character is the stop byte.

These codecs do not currently support BYTE_ARRAY type outputs. It's necessary to reject them if a file attempts to specify them in a context where array data is expected.

* Ensure ref_id is valid when reading CRAM RI data series In multi-ref containers, the reference IDs are stored in the RI data series. These need to be checked to ensure they are in the expected range. * Add check to ensure mate ref id is valid when decoding cram

This error can be caught early when reading the AP data series. Records that span beyond the end of the reference are dealt with later.

* Fix out-by-one in check for CRAM BB series data * Fix out-by-one in check for CRAM BA series data * Reject cram reads that end past slice reference extents Copies a check added in efe502d and applies it to the case where matching bases are added from the last feature up to the reported read length, and the ending reference position is beyond the reference length listed in the header.

The check needs to deal with: * Records where no seq / qual was stored (which have cr->len == 0) * Features which write to seq / qual, which cannot be placed after the end of the read * Features like deletions and ref skips that do not write to seq or qual, and may be placed immediately after the end of the read

In SAM, these have sequence "*". In CRAM, this is indicated by setting cram_record::len to 0, although strangely some dummy sequence data needs to be stored in the file for these records so that feature lengths can be returned (e.g. for CIGAR matches). In such cases, the data read needs to be discarded, and nothing should be written to the seq or qual strings. Quality data probably shouldn't be present either for these, but if it is, it also needs to be discarded. This commit fixes some cases where this was not being handled correctly. * The code for copying over unchanged parts of the reference needs to skip the actual copy when seq is not stored. * The soft clip handler has fall-backs for when the IN or SC data stream is missing. These need to be prevented from writing to seq if it isn't present. * When processing the QS data stream, suppress storing the quality values when cr->len == 0. * If CRAM_FLAG_PRESERVE_QUAL_SCORES is false, don't try to add default quality values if cr->len == 0, and also don't try to check that the first quality value is 255 (which is used for the case where seq was present but the quality values were not).

Setting out to NULL indicates that the data should be discarded. This is mainly used in the case of records where the sequence or quality values were not present, but some dummy data is needed in order to reconstruct the CIGAR and MD records. * Make const and xrle char decoders work with out == NULL * Add missed case for NULL out passed to xpack char decoder

daviesrob and others added 10 commits March 12, 2026 12:01

Fix byte_array_stop out-of-data checks

88cdf69

The old version triggered one byte too late. However nuance is needed as it's permissible to completely fill the output buffer as long as the next input character is the stop byte.

Improve option checking for varint and const codecs

0ec4367

These codecs do not currently support BYTE_ARRAY type outputs. It's necessary to reject them if a file attempts to specify them in a context where array data is expected.

Catch cram records that start before the current reference segment

4a5ef25

This error can be caught early when reading the AP data series. Records that span beyond the end of the reference are dealt with later.

Reject cram files with negative-length D, H, P or N features

2a45eb1

Reject CRAM blocks with an unknown compression method.

654a04a

daviesrob force-pushed the cram-work branch from a09c40f to 654a04a Compare March 18, 2026 16:56

daviesrob changed the title ~~Placeholder for cram work~~ Improve CRAM reader input validation Mar 18, 2026

daviesrob marked this pull request as ready for review March 18, 2026 17:04

daviesrob merged commit 654a04a into samtools:develop Mar 18, 2026
9 checks passed

daviesrob deleted the cram-work branch March 19, 2026 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CRAM reader input validation#1991

Improve CRAM reader input validation#1991
daviesrob merged 10 commits intosamtools:developfrom
daviesrob:cram-work

daviesrob commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daviesrob commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daviesrob commented Mar 16, 2026 •

edited

Loading