Skip to content

allow protein analyses#1958

Merged
jameshadfield merged 12 commits intomasterfrom
amino-acid-augur
Apr 22, 2026
Merged

allow protein analyses#1958
jameshadfield merged 12 commits intomasterfrom
amino-acid-augur

Conversation

@jameshadfield
Copy link
Copy Markdown
Member

@jameshadfield jameshadfield commented Feb 9, 2026

This allows protein sequences to be analysed in a typical augur pipeline (align, tree, refine, ancestral, traits, export)

I've created this development workflow in seasonal-flu to run real-life AA-only analyses for testing purposes. I have a PR on docs.nextstrain.org with a guide on how to run such analyses.

See commit messages for details

This PR is on top of #1975

This will close #820 but needs to be paired with new docs and a new auspice version

To Do

jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Mar 31, 2026
@jameshadfield jameshadfield changed the base branch from master to james/ancestral-bug-fixes April 1, 2026 03:08
Comment thread augur/validate_export.py Outdated
jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Apr 1, 2026
@jameshadfield jameshadfield changed the title [WIP] allow protein analyses allow protein analyses Apr 1, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.11%. Comparing base (5f616ee) to head (4280dc2).

Files with missing lines Patch % Lines
augur/ancestral.py 80.21% 9 Missing and 9 partials ⚠️
augur/filter/_run.py 20.00% 2 Missing and 2 partials ⚠️
augur/refine.py 76.92% 2 Missing and 1 partial ⚠️
augur/filter/validate_arguments.py 0.00% 1 Missing and 1 partial ⚠️
augur/subsample.py 86.66% 1 Missing and 1 partial ⚠️
augur/index.py 95.65% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1958      +/-   ##
==========================================
+ Coverage   75.06%   75.11%   +0.05%     
==========================================
  Files          83       83              
  Lines        9708     9810     +102     
  Branches     1969     1998      +29     
==========================================
+ Hits         7287     7369      +82     
- Misses       2084     2096      +12     
- Partials      337      345       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jameshadfield added a commit to nextstrain/auspice that referenced this pull request Apr 1, 2026
Follows the work done in Augur <nextstrain/augur#1958>
which relaxes the JSON schema to allow annotations without a `nuc` entry.
We create a placeholder one in Auspice which spans the observed CDS (nuc) ranges
so that the entropy rendering code can still operate.
jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Apr 2, 2026
Copy link
Copy Markdown
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left general design questions, but the only blocking comment is to verify single gene vs multiple genes defined in a single file.

Comment thread augur/data/schema-annotations.json
Comment thread augur/refine.py Outdated
Comment thread augur/ancestral.py Outdated
Comment thread augur/ancestral.py Outdated
jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Apr 7, 2026
@jameshadfield jameshadfield force-pushed the james/ancestral-bug-fixes branch from a0c8197 to d1d7522 Compare April 14, 2026 02:58
Base automatically changed from james/ancestral-bug-fixes to master April 14, 2026 03:19
jameshadfield added a commit that referenced this pull request Apr 14, 2026
Review feedback by @joverlee521
<#1958 (comment)>

Using an explicit argument when the root sequence represents proteins
improves both code clarity and (self) documentation.
Allows for protein-only analyses to be exportable. Note that Auspice PR
<nextstrain/auspice#2040> is needed for the entropy
panel to display for such datasets.
jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Apr 14, 2026
jameshadfield added a commit that referenced this pull request Apr 14, 2026
Review feedback by @joverlee521
<#1958 (comment)>

Using an explicit argument when the root sequence represents proteins
improves both code clarity and (self) documentation.
@jameshadfield
Copy link
Copy Markdown
Member Author

Rebased onto master which now includes #1975 and #1962

Copy link
Copy Markdown
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for going through the other augur commands (sorry I opened the rabbit hole)! My main concern is the auto-detect of sequence types in augur index can potentially change behavior with the sequence filters.

Comment thread augur/index.py Outdated
Comment thread tests/functional/filter/cram/filter-no-sequence-index-error.t Outdated
Comment thread augur/filter/__init__.py
sequence_filter_group.add_argument('--min-length', type=int, help=descriptions['min_length'])
sequence_filter_group.add_argument('--max-length', type=int, help=descriptions['max_length'])
sequence_filter_group.add_argument('--non-nucleotide', action='store_true', help=descriptions['non_nucleotide'])
sequence_filter_group.add_argument('--exclude-invalid', action='store_true', help=descriptions['exclude_invalid'])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my other comment for index, thoughts on adding a --seq-type option here to be able to override auto-detect.

If I somehow have a bad nucleotide sequence "AAAETCGG", I'd expect it to be filtered out by --exclude-invalid, but in this case it would auto-detected as "aa" and pass the filter.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed up 00d5223 to show example Cram test that passes on master, but not on this branch.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with --seq-type I'll change refine's proposed --aa flag to --seq-type aa, yeah?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh huh, yeah probably best to match the options across commands. So I guess there's two choices here

  1. Keep auto-detect for index/filter. All commands support --seq-type aa/nuc to override.
  2. No auto-detect at all. All commands default to nuc and support the --aa flag to switch to aa.

We could also add a new envvar AUGUR_SEQ_TYPE to set the seq type across multiple commands.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in dev chat, moved to --seq-type throughout. We can add AUGUR_SEQ_TYPE later as desired.

Comment thread augur/filter/arguments.py Outdated
Comment thread augur/subsample.py Outdated
Comment thread augur/subsample.py Outdated
Comment thread CHANGES.md Outdated
TreeTime already supported (one!) AA model, so all that was required on
the Augur side was adding an "--seq-type" arg and changing the TreeTime
parameterisation accordingly.
Adds a fair bit of code complexity but one step closer to augur
pipelines which are protein-only. The actual reconstruction is
unchanged, as we were already doing independent nuc / aa
reconstructions.

There are some rough edges to be tidied up in subsequent commits:

1. The annotation parsing requires the file to define the nucleotide
   coordinates, which we don't need for a protein-only analysis.
   Allowing the nucleodide definition to be optional is somewhat
   complicated, and most annotations files will have them anyways.
   However if we're only reconstructing a single gene it'd be really
   nice to make the annotations file optional, we can make up dummy nuc
   coordinates for the gene (e.g. 1..3*aa_len) easily.
2. Allow a (AA) root-sequence to be provided. There are different ways
   to do this, but the simplest would be to allow them to be provided
   analogously to the ``--translations`` argument.
(if we are only reconstructing amino-acid sequences). For a typical
single-gene amino-acid analysis it would often be frustrating to need to
create a genemap when we can create dummy nucleotide coordinates to
represent the gene and the interpretation of result will be just as
valid.
Review feedback by @joverlee521
<#1958 (comment)>

Using an explicit argument when the root sequence represents proteins
improves both code clarity and (self) documentation.
jameshadfield added a commit that referenced this pull request Apr 20, 2026
This was motivated by the desire to allow sequence-based filters for
`augur filter`, which will require the index to be run for AA sequences.
The choice to use an explicit `--seq-type` argument came from PR
discussion <#1958 (comment)>
jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Apr 20, 2026
jameshadfield and others added 4 commits April 20, 2026 14:11
This was motivated by the desire to allow sequence-based filters for
`augur filter`, which will require the index to be run for AA sequences.
The choice to use an explicit `--seq-type` argument came from PR
discussion <#1958 (comment)>
explicitly to check behaviour of whether non-A/T/G/C characters
count towards the length (they don't - good!)
The only change for nucleotide sequences (i.e. every existing usage)
is the argument change of '--non-nucleotide' to '--exclude-invalid'.
(I think the new argument is more self-explanatory!)
@jameshadfield jameshadfield merged commit d3e0489 into master Apr 22, 2026
35 checks passed
@jameshadfield jameshadfield deleted the amino-acid-augur branch April 22, 2026 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support protein (AA) analyses

3 participants