Add API functions to parse and format SEQ and QUAL fields by jmarshall · Pull Request #1974 · samtools/htslib

jmarshall · 2025-11-26T03:32:25Z

This adds functions to the public API to pack and unpack the SEQ and QUAL fields individually, enabling third-party code to take advantage of the optimised and SIMD-optimized implementations of this functionality that HTSlib provides.

The format ones were the motivation for this; in particular they will be immediately useful for pysam. The parse ones are perhaps of less widespread use (at least in their current form) as usually if writing to a bam1_t there’ll need to be some memory reallocating going on too. But I think pysam would benefit from accessing HTSlib’s implementations of these too.

The optimised COPY_MINUS_N() implementation declares its own locals. Make the simple COPY_MINUS_N() declare its own local i rather than shadowing a local of that name from its caller.

daviesrob · 2025-12-18T14:11:33Z

This mostly looks OK to me. It might be good to mention in the documentation exactly how big the destination buffer needs to be, especially for bam_format_seq() and bam_format_qual() where it's a bit less obvious. Also exactly what's going on in sam_format_qual() and sam_parse_qual() - that 33 is added to / subtracted from the values to convert them to / from printable ASCII; and that the sam_parse_qual() error state means that conversion failed because one of the input values was too small.

* State that bam_format_seq() and bam_format_qual() destination buffers must be at least b->core.l_qseq bytes long * Note that ASCII values follow the SAM QUAL format * Explain how sam_parse_qual() can fail * sam_parse_seq() always succeeds, so won't return a negative value on error.

These both take `size_t len` and on success return the number of bytes written, so the return type needs to be able to handle a similar magnitude. As sam_parse_seq() cannot fail, it can return size_t. As sam_parse_qual() may need to report an error, it is updated to return ssize_t. In addition, change the second COPY_MINUS_N() implementation to use size_t for its loop variable (the first one already used it).

daviesrob · 2026-03-27T16:13:31Z

Sorry to be a while getting back to this. I've taken the liberty of adding commits with some documentation updates, and to change the return types of sam_parse_seq() and sam_parse_qual() as int looked a bit small compared to the size_t len parameter.
Please let me know if they look OK to you.

I've added documentation to bam_format_seq() and bam_format_qual() emphasising how big the destination buffer should be, but what would you think to adding a len parameter in snprintf() style? There may be benefits to making the length available explicit rather then implicit.

jkbonfield · 2026-04-02T11:16:23Z

I don't particularly like the function names as they're somewhat ambiguous.

We have a mix of things used currently. bam_get_cigar queries a bam structure and returned the cigar field. Here "bam" is the source. bam_parse_cigar turns a cigar string into the format used in bam, so "bam" here is the destination. (Poorly named and I think could be bam_format_cigar, but too late!) sam_format_aux1 takes BAM encoded aux fields and converts to SAM format, so SAM is the destination again.

Typically it looks like the first part of our function names is where we're writing to if we're encoding data and reading from if we're simply returning it or processing (eg pileup). We typically don't have the name of the other encoding type anywhere in the function. So sam_format_aux1 is essentially BAM2SAM conversion, but only the SAM bit is listed.

This works for sam_format_seq which is turning BAM nibble encoded data into SAM string form.

However bam_format_seq is not the reverse and does not create nibble data. It causes an ambiguity as now "bam" is the type of data we're coming from rather than the type of data we're writing to, and as mentioned above mostly when we're modifying records it's the destination type which is the first part of the function name.

Maybe we just need to rename both functions to be explicit. Eg sam_format_seq_uint8 and sam_format_seq_bam1? I'm not convinced I liked it much more, but I can't think of anything better currently.

jkbonfield · 2026-04-02T11:25:37Z

The other thought is maybe it's the sam_ prefix which is wrong here. If we had bam_unpack_seq and bam_unpack_qual then it fits the standard model which is as we're not modifying anything then the function is named in a similar way to an accessor function or a processing function, where the data format is the function name prefix. If so, what's the naked version of this without a bam1_t pointer? It's not really SAM as the data isn't in SAM textual format and it's still nibble packed or qual-33. Maybe a generic hts_ prefix instead?

Really what I'm getting at is simply having both sam_ and bam_ functions where neither the source format is differing nor the destination format is differing is confusing.

jmarshall added 3 commits December 11, 2025 18:56

Make nibble2base() et al's source parameter a pointer to const

5c81990

Reduce local variable shadowing and fix second COPY_MINUS_N function

2d2d5aa

The optimised COPY_MINUS_N() implementation declares its own locals. Make the simple COPY_MINUS_N() declare its own local i rather than shadowing a local of that name from its caller.

Add API functions to parse and format SEQ and QUAL fields

9de8e4f

jmarshall force-pushed the seq-qual-accessors branch from e413e21 to 9de8e4f Compare December 11, 2025 08:57

jmarshall marked this pull request as ready for review December 11, 2025 08:58

daviesrob self-assigned this Dec 18, 2025

daviesrob added 2 commits March 27, 2026 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API functions to parse and format SEQ and QUAL fields#1974

Add API functions to parse and format SEQ and QUAL fields#1974
jmarshall wants to merge 5 commits intosamtools:developfrom
jmarshall:seq-qual-accessors

jmarshall commented Nov 26, 2025 •

edited

Loading

Uh oh!

daviesrob commented Dec 18, 2025

Uh oh!

daviesrob commented Mar 27, 2026

Uh oh!

jkbonfield commented Apr 2, 2026 •

edited

Loading

Uh oh!

jkbonfield commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jmarshall commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daviesrob commented Dec 18, 2025

Uh oh!

daviesrob commented Mar 27, 2026

Uh oh!

jkbonfield commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkbonfield commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jmarshall commented Nov 26, 2025 •

edited

Loading

jkbonfield commented Apr 2, 2026 •

edited

Loading

jkbonfield commented Apr 2, 2026 •

edited

Loading