Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a function to extract the sort order from an AlignmentHeader #193

Open
msto opened this issue Nov 14, 2024 · 8 comments
Open

Add a function to extract the sort order from an AlignmentHeader #193

msto opened this issue Nov 14, 2024 · 8 comments

Comments

@msto
Copy link
Contributor

msto commented Nov 14, 2024

Requested by @yfarjoun

A new enum, SortOrder, should be added to enumerate the valid sort orders.

A new function, get_sort_order(header: pysam.AlignmentHeader) -> SortOrder | None should be implemented.

This function should extract the sort order from the header when possible. Specifically, the function should:

  1. Validate the presence of HD in the header. If it is absent, return None.
  2. Validate the presence of the key SO in the HD dictionary. If it is absent, return None.
  3. Attempt to parse the sort-order from the value associated with the SO key. If it is valid, the corresponding SortOrder member should be returned. If it is invalid, either raise a ValueError or return an SortOrder.UNKNOWN member, possibly with the raw value somehow associated.
@msto
Copy link
Contributor Author

msto commented Nov 14, 2024

@tfenne Any thoughts on how we should resolve (3) if an invalid/unknown sort-order value exists?

@yfarjoun
Copy link
Contributor

fgpyo has the enum already (fgpyo/sam/init.py:1015) and there's an Unknown (which should probably be returned instead of "None")

I'd leave special error throwing or raw-value-returning out of it, and keep it simple and predictable.

@yfarjoun
Copy link
Contributor

so maybe we can add a from_header method to the SamOrder enum that will take in a pysam header and spit out an Enum instance?

@msto
Copy link
Contributor Author

msto commented Nov 14, 2024

Oh nice, I didn't know we already had that enum.

In the course of resolving this issue we should also update that enum to improve its documentation (i.e. include the member attributes in the docstring so they show up in the rendered docs) and consider getting it inline with our usual enum conventions (all-caps member names).

We should also consider whether to include sort orders that aren't part of htslib. I'm thinking of template-coordinate sort order, in particular.

@msto
Copy link
Contributor Author

msto commented Nov 14, 2024

In an ideal world this would probably be a method on AlignmentHeader, but since we don't have such powers, adding a classmethod to the enum seems reasonable. (I'd probably write it as a standalone utility function, but will leave it up to Tim.)

@msto
Copy link
Contributor Author

msto commented Nov 14, 2024

Re: returning "UNKNOWN" - this is an edge case in the SAM spec I'm not actually familiar with. Is there a distinction between an "unknown" sort order and an invalid sort order? i.e. is "unknown" reserved for SAMs with no specified SO tag or an explicit "unknown" annotation?

If so, then our implementation should be consistent with that, and we should raise an error or otherwise distinguish the case where the SO tag contains an invalid value

@yfarjoun
Copy link
Contributor

So the relevant part of the spec is:
image

as you can see. both unknown and unsorted are allowed, with unknown being the default. I wouldn't be sad if an unexpected value raises an exception, but if missing anything is interpreted as "unknown"

@yfarjoun
Copy link
Contributor

I think that there's an unofficial duplicate order which might be used by fgbio.... we might want to get that into the spec if we're going to cross-reference it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants