Skip to content

Add API functions to parse and format SEQ and QUAL fields#1974

Open
jmarshall wants to merge 5 commits intosamtools:developfrom
jmarshall:seq-qual-accessors
Open

Add API functions to parse and format SEQ and QUAL fields#1974
jmarshall wants to merge 5 commits intosamtools:developfrom
jmarshall:seq-qual-accessors

Conversation

@jmarshall
Copy link
Copy Markdown
Member

@jmarshall jmarshall commented Nov 26, 2025

This adds functions to the public API to pack and unpack the SEQ and QUAL fields individually, enabling third-party code to take advantage of the optimised and SIMD-optimized implementations of this functionality that HTSlib provides.

The format ones were the motivation for this; in particular they will be immediately useful for pysam. The parse ones are perhaps of less widespread use (at least in their current form) as usually if writing to a bam1_t there’ll need to be some memory reallocating going on too. But I think pysam would benefit from accessing HTSlib’s implementations of these too.

The optimised COPY_MINUS_N() implementation declares its own locals.
Make the simple COPY_MINUS_N() declare its own local i rather than
shadowing a local of that name from its caller.
@jmarshall jmarshall marked this pull request as ready for review December 11, 2025 08:58
@daviesrob daviesrob self-assigned this Dec 18, 2025
@daviesrob
Copy link
Copy Markdown
Member

This mostly looks OK to me. It might be good to mention in the documentation exactly how big the destination buffer needs to be, especially for bam_format_seq() and bam_format_qual() where it's a bit less obvious. Also exactly what's going on in sam_format_qual() and sam_parse_qual() - that 33 is added to / subtracted from the values to convert them to / from printable ASCII; and that the sam_parse_qual() error state means that conversion failed because one of the input values was too small.

* State that bam_format_seq() and bam_format_qual() destination
  buffers must be at least b->core.l_qseq bytes long

* Note that ASCII values follow the SAM QUAL format

* Explain how sam_parse_qual() can fail

* sam_parse_seq() always succeeds, so won't return a negative
  value on error.
These both take `size_t len` and on success return the number
of bytes written, so the return type needs to be able to handle
a similar magnitude.

As sam_parse_seq() cannot fail, it can return size_t.

As sam_parse_qual() may need to report an error, it is updated
to return ssize_t.

In addition, change the second COPY_MINUS_N() implementation to
use size_t for its loop variable (the first one already used it).
@daviesrob
Copy link
Copy Markdown
Member

Sorry to be a while getting back to this. I've taken the liberty of adding commits with some documentation updates, and to change the return types of sam_parse_seq() and sam_parse_qual() as int looked a bit small compared to the size_t len parameter.
Please let me know if they look OK to you.

I've added documentation to bam_format_seq() and bam_format_qual() emphasising how big the destination buffer should be, but what would you think to adding a len parameter in snprintf() style? There may be benefits to making the length available explicit rather then implicit.

@jkbonfield
Copy link
Copy Markdown
Contributor

jkbonfield commented Apr 2, 2026

I don't particularly like the function names as they're somewhat ambiguous.

We have a mix of things used currently. bam_get_cigar queries a bam structure and returned the cigar field. Here "bam" is the source. bam_parse_cigar turns a cigar string into the format used in bam, so "bam" here is the destination. (Poorly named and I think could be bam_format_cigar, but too late!) sam_format_aux1 takes BAM encoded aux fields and converts to SAM format, so SAM is the destination again.

Typically it looks like the first part of our function names is where we're writing to if we're encoding data and reading from if we're simply returning it or processing (eg pileup). We typically don't have the name of the other encoding type anywhere in the function. So sam_format_aux1 is essentially BAM2SAM conversion, but only the SAM bit is listed.

This works for sam_format_seq which is turning BAM nibble encoded data into SAM string form.

However bam_format_seq is not the reverse and does not create nibble data. It causes an ambiguity as now "bam" is the type of data we're coming from rather than the type of data we're writing to, and as mentioned above mostly when we're modifying records it's the destination type which is the first part of the function name.

Maybe we just need to rename both functions to be explicit. Eg sam_format_seq_uint8 and sam_format_seq_bam1? I'm not convinced I liked it much more, but I can't think of anything better currently.

@jkbonfield
Copy link
Copy Markdown
Contributor

jkbonfield commented Apr 2, 2026

The other thought is maybe it's the sam_ prefix which is wrong here. If we had bam_unpack_seq and bam_unpack_qual then it fits the standard model which is as we're not modifying anything then the function is named in a similar way to an accessor function or a processing function, where the data format is the function name prefix. If so, what's the naked version of this without a bam1_t pointer? It's not really SAM as the data isn't in SAM textual format and it's still nibble packed or qual-33. Maybe a generic hts_ prefix instead?

Really what I'm getting at is simply having both sam_ and bam_ functions where neither the source format is differing nor the destination format is differing is confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants