Improve CRAM reader input validation#1991
Merged
daviesrob merged 10 commits intosamtools:developfrom Mar 18, 2026
Merged
Conversation
The old version triggered one byte too late. However nuance is needed as it's permissible to completely fill the output buffer as long as the next input character is the stop byte.
These codecs do not currently support BYTE_ARRAY type outputs. It's necessary to reject them if a file attempts to specify them in a context where array data is expected.
* Ensure ref_id is valid when reading CRAM RI data series In multi-ref containers, the reference IDs are stored in the RI data series. These need to be checked to ensure they are in the expected range. * Add check to ensure mate ref id is valid when decoding cram
This error can be caught early when reading the AP data series. Records that span beyond the end of the reference are dealt with later.
* Fix out-by-one in check for CRAM BB series data * Fix out-by-one in check for CRAM BA series data * Reject cram reads that end past slice reference extents Copies a check added in efe502d and applies it to the case where matching bases are added from the last feature up to the reported read length, and the ending reference position is beyond the reference length listed in the header.
The check needs to deal with: * Records where no seq / qual was stored (which have cr->len == 0) * Features which write to seq / qual, which cannot be placed after the end of the read * Features like deletions and ref skips that do not write to seq or qual, and may be placed immediately after the end of the read
In SAM, these have sequence "*". In CRAM, this is indicated by setting cram_record::len to 0, although strangely some dummy sequence data needs to be stored in the file for these records so that feature lengths can be returned (e.g. for CIGAR matches). In such cases, the data read needs to be discarded, and nothing should be written to the seq or qual strings. Quality data probably shouldn't be present either for these, but if it is, it also needs to be discarded. This commit fixes some cases where this was not being handled correctly. * The code for copying over unchanged parts of the reference needs to skip the actual copy when seq is not stored. * The soft clip handler has fall-backs for when the IN or SC data stream is missing. These need to be prevented from writing to seq if it isn't present. * When processing the QS data stream, suppress storing the quality values when cr->len == 0. * If CRAM_FLAG_PRESERVE_QUAL_SCORES is false, don't try to add default quality values if cr->len == 0, and also don't try to check that the first quality value is 255 (which is used for the case where seq was present but the quality values were not).
Setting out to NULL indicates that the data should be discarded. This is mainly used in the case of records where the sequence or quality values were not present, but some dummy data is needed in order to reconstruct the CIGAR and MD records. * Make const and xrle char decoders work with out == NULL * Add missed case for NULL out passed to xpack char decoder
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes multiple cases where the CRAM reader could misbehave on reading invalid input files.