allow protein analyses by jameshadfield · Pull Request #1958 · nextstrain/augur

jameshadfield · 2026-02-09T02:03:04Z

This allows protein sequences to be analysed in a typical augur pipeline (align, tree, refine, ancestral, traits, export)

I've created this development workflow in seasonal-flu to run real-life AA-only analyses for testing purposes. I have a PR on docs.nextstrain.org with a guide on how to run such analyses.

See commit messages for details

This PR is on top of #1975

This will close #820 but needs to be paired with new docs and a new auspice version

To Do

Wait until fill overhangs w X not N for aa seqs (#522) neherlab/treetime#523 is merged and a new treetime release made, then enforce a minimum treetime version for protein analyses (not actually needed, our ancestral reconstruction is being done with JC69(alphabet=aa))
changelog
coordinate release with docs PR & auspice PR
fix failing tests (failing test is unrelated broken link)

Augur PR <nextstrain/augur#1958> Auspice PR TODO XXX

codecov · 2026-04-01T03:56:46Z

Codecov Report

❌ Patch coverage is 83.33333% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.11%. Comparing base (5f616ee) to head (4280dc2).

Files with missing lines	Patch %	Lines
augur/ancestral.py	80.21%	9 Missing and 9 partials ⚠️
augur/filter/_run.py	20.00%	2 Missing and 2 partials ⚠️
augur/refine.py	76.92%	2 Missing and 1 partial ⚠️
augur/filter/validate_arguments.py	0.00%	1 Missing and 1 partial ⚠️
augur/subsample.py	86.66%	1 Missing and 1 partial ⚠️
augur/index.py	95.65%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1958      +/-   ##
==========================================
+ Coverage   75.06%   75.11%   +0.05%     
==========================================
  Files          83       83              
  Lines        9708     9810     +102     
  Branches     1969     1998      +29     
==========================================
+ Hits         7287     7369      +82     
- Misses       2084     2096      +12     
- Partials      337      345       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Follows the work done in Augur <nextstrain/augur#1958> which relaxes the JSON schema to allow annotations without a `nuc` entry. We create a placeholder one in Auspice which spans the observed CDS (nuc) ranges so that the entropy rendering code can still operate.

Augur PR <nextstrain/augur#1958> Auspice PR <nextstrain/auspice#2040>

joverlee521

Left general design questions, but the only blocking comment is to verify single gene vs multiple genes defined in a single file.

Augur PR <nextstrain/augur#1958> Auspice PR <nextstrain/auspice#2040>

@joverlee521

Review feedback by @joverlee521 <#1958 (comment)> Using an explicit argument when the root sequence represents proteins improves both code clarity and (self) documentation.

Allows for protein-only analyses to be exportable. Note that Auspice PR <nextstrain/auspice#2040> is needed for the entropy panel to display for such datasets.

Augur PR <nextstrain/augur#1958> Auspice PR <nextstrain/auspice#2040>

@joverlee521

Review feedback by @joverlee521 <#1958 (comment)> Using an explicit argument when the root sequence represents proteins improves both code clarity and (self) documentation.

jameshadfield · 2026-04-14T09:25:46Z

Rebased onto master which now includes #1975 and #1962

joverlee521

Thanks for going through the other augur commands (sorry I opened the rabbit hole)! My main concern is the auto-detect of sequence types in augur index can potentially change behavior with the sequence filters.

joverlee521 · 2026-04-14T19:29:31Z

    sequence_filter_group.add_argument('--min-length', type=int, help=descriptions['min_length'])
    sequence_filter_group.add_argument('--max-length', type=int, help=descriptions['max_length'])
-    sequence_filter_group.add_argument('--non-nucleotide', action='store_true', help=descriptions['non_nucleotide'])
+    sequence_filter_group.add_argument('--exclude-invalid', action='store_true', help=descriptions['exclude_invalid'])


Related to my other comment for index, thoughts on adding a --seq-type option here to be able to override auto-detect.

If I somehow have a bad nucleotide sequence "AAAETCGG", I'd expect it to be filtered out by --exclude-invalid, but in this case it would auto-detected as "aa" and pass the filter.

Pushed up 00d5223 to show example Cram test that passes on master, but not on this branch.

If we go with --seq-type I'll change refine's proposed --aa flag to --seq-type aa, yeah?

Oh huh, yeah probably best to match the options across commands. So I guess there's two choices here

Keep auto-detect for index/filter. All commands support --seq-type aa/nuc to override.

No auto-detect at all. All commands default to nuc and support the --aa flag to switch to aa.

We could also add a new envvar AUGUR_SEQ_TYPE to set the seq type across multiple commands.

As discussed in dev chat, moved to --seq-type throughout. We can add AUGUR_SEQ_TYPE later as desired.

TreeTime already supported (one!) AA model, so all that was required on the Augur side was adding an "--seq-type" arg and changing the TreeTime parameterisation accordingly.

Adds a fair bit of code complexity but one step closer to augur pipelines which are protein-only. The actual reconstruction is unchanged, as we were already doing independent nuc / aa reconstructions. There are some rough edges to be tidied up in subsequent commits: 1. The annotation parsing requires the file to define the nucleotide coordinates, which we don't need for a protein-only analysis. Allowing the nucleodide definition to be optional is somewhat complicated, and most annotations files will have them anyways. However if we're only reconstructing a single gene it'd be really nice to make the annotations file optional, we can make up dummy nuc coordinates for the gene (e.g. 1..3*aa_len) easily. 2. Allow a (AA) root-sequence to be provided. There are different ways to do this, but the simplest would be to allow them to be provided analogously to the ``--translations`` argument.

(if we are only reconstructing amino-acid sequences). For a typical single-gene amino-acid analysis it would often be frustrating to need to create a genemap when we can create dummy nucleotide coordinates to represent the gene and the interpretation of result will be just as valid.

@joverlee521

Review feedback by @joverlee521 <#1958 (comment)> Using an explicit argument when the root sequence represents proteins improves both code clarity and (self) documentation.

This was motivated by the desire to allow sequence-based filters for `augur filter`, which will require the index to be run for AA sequences. The choice to use an explicit `--seq-type` argument came from PR discussion <#1958 (comment)>

Augur PR <nextstrain/augur#1958> Auspice PR <nextstrain/auspice#2040>

This was motivated by the desire to allow sequence-based filters for `augur filter`, which will require the index to be run for AA sequences. The choice to use an explicit `--seq-type` argument came from PR discussion <#1958 (comment)>

explicitly to check behaviour of whether non-A/T/G/C characters count towards the length (they don't - good!)

The only change for nucleotide sequences (i.e. every existing usage) is the argument change of '--non-nucleotide' to '--exclude-invalid'. (I think the new argument is more self-explanatory!)

Deprecates the 'non_nucleotide' key (see parent commit)

This was referenced Mar 22, 2026

Support protein (AA) analyses #820

Closed

Ancestral reconstruction bug fixes #1975

Merged

jameshadfield force-pushed the amino-acid-augur branch from 4ca0213 to 4786af7 Compare March 27, 2026 02:24

jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Mar 31, 2026

WIP Guide for AA-only workflows

c16af91

Augur PR <nextstrain/augur#1958> Auspice PR TODO XXX

jameshadfield force-pushed the amino-acid-augur branch from 4786af7 to 242ced3 Compare April 1, 2026 03:03

jameshadfield changed the base branch from master to james/ancestral-bug-fixes April 1, 2026 03:08

jameshadfield commented Apr 1, 2026

View reviewed changes

Comment thread augur/validate_export.py Outdated

jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Apr 1, 2026

WIP Guide for AA-only workflows

1833cd6

Augur PR <nextstrain/augur#1958> Auspice PR TODO XXX

jameshadfield mentioned this pull request Apr 1, 2026

Guide for AA-only workflows nextstrain/docs.nextstrain.org#270

Open

jameshadfield changed the title ~~[WIP] allow protein analyses~~ allow protein analyses Apr 1, 2026

jameshadfield force-pushed the amino-acid-augur branch from 242ced3 to 889a7a3 Compare April 1, 2026 03:37

jameshadfield mentioned this pull request Apr 1, 2026

Allow protein-only datasets nextstrain/auspice#2040

Merged

3 tasks

jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Apr 2, 2026

Guide for AA-only workflows

f1f421f

Augur PR <nextstrain/augur#1958> Auspice PR <nextstrain/auspice#2040>

joverlee521 requested changes Apr 3, 2026

View reviewed changes

Comment thread augur/data/schema-annotations.json

Comment thread augur/refine.py Outdated

Comment thread augur/ancestral.py Outdated

Comment thread augur/ancestral.py Outdated

jameshadfield mentioned this pull request Apr 7, 2026

[refine] ? allow substitution model selection #1982

Open

jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Apr 7, 2026

Guide for AA-only workflows

82e57ae

Augur PR <nextstrain/augur#1958> Auspice PR <nextstrain/auspice#2040>

jameshadfield force-pushed the james/ancestral-bug-fixes branch from a0c8197 to d1d7522 Compare April 14, 2026 02:58

Base automatically changed from james/ancestral-bug-fixes to master April 14, 2026 03:19

jameshadfield added a commit that referenced this pull request Apr 14, 2026

[ancestral] --aa-root-sequence arg

5b7f524

Review feedback by @joverlee521 <#1958 (comment)> Using an explicit argument when the root sequence represents proteins improves both code clarity and (self) documentation.

jameshadfield force-pushed the amino-acid-augur branch from 889a7a3 to 72dad75 Compare April 14, 2026 03:21

[export] Allow annotations without nucleotides

e4460fd

Allows for protein-only analyses to be exportable. Note that Auspice PR <nextstrain/auspice#2040> is needed for the entropy panel to display for such datasets.

jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Apr 14, 2026

Guide for AA-only workflows

c39ed22

Augur PR <nextstrain/augur#1958> Auspice PR <nextstrain/auspice#2040>

jameshadfield added a commit that referenced this pull request Apr 14, 2026

[ancestral] --aa-root-sequence arg

a0dc8e0

Review feedback by @joverlee521 <#1958 (comment)> Using an explicit argument when the root sequence represents proteins improves both code clarity and (self) documentation.

jameshadfield force-pushed the amino-acid-augur branch from 72dad75 to db37bb1 Compare April 14, 2026 09:21

joverlee521 requested changes Apr 14, 2026

View reviewed changes

victorlin reviewed Apr 15, 2026

View reviewed changes

Comment thread augur/filter/arguments.py Outdated

Comment thread augur/subsample.py Outdated

Comment thread augur/subsample.py Outdated

Comment thread CHANGES.md Outdated

[refine] Allow AA alignments (JTT92 model)

c8b3591

TreeTime already supported (one!) AA model, so all that was required on the Augur side was adding an "--seq-type" arg and changing the TreeTime parameterisation accordingly.

jameshadfield added 3 commits April 20, 2026 12:58

[ancestral] --aa-root-sequence arg

70dc188

Review feedback by @joverlee521 <#1958 (comment)> Using an explicit argument when the root sequence represents proteins improves both code clarity and (self) documentation.

jameshadfield force-pushed the amino-acid-augur branch from db37bb1 to e6a346c Compare April 20, 2026 01:33

jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Apr 20, 2026

Guide for AA-only workflows

644c70a

Augur PR <nextstrain/augur#1958> Auspice PR <nextstrain/auspice#2040>

jameshadfield force-pushed the amino-acid-augur branch from e6a346c to e93b48e Compare April 20, 2026 01:39

jameshadfield and others added 4 commits April 20, 2026 14:11

[index] Allow AA sequences to be indexed

b0d3ab3

This was motivated by the desire to allow sequence-based filters for `augur filter`, which will require the index to be run for AA sequences. The choice to use an explicit `--seq-type` argument came from PR discussion <#1958 (comment)>

Add Cram test for --non-nucleotide

3a5e780

[filter] add sequence length tests

c27e2fb

explicitly to check behaviour of whether non-A/T/G/C characters count towards the length (they don't - good!)

[filter] adapt sequence-filters for AA seqs

6b9273c

The only change for nucleotide sequences (i.e. every existing usage) is the argument change of '--non-nucleotide' to '--exclude-invalid'. (I think the new argument is more self-explanatory!)

jameshadfield force-pushed the amino-acid-augur branch from e93b48e to 4280dc2 Compare April 20, 2026 02:12

jameshadfield added 3 commits April 21, 2026 08:47

[subsample] update sample options

2d61ee0

Deprecates the 'non_nucleotide' key (see parent commit)

[subsample] add --seq-type argument

0ec3580

changelog

e6a6972

jameshadfield force-pushed the amino-acid-augur branch from 4280dc2 to e6a6972 Compare April 20, 2026 20:49

jameshadfield merged commit d3e0489 into master Apr 22, 2026
35 checks passed

jameshadfield deleted the amino-acid-augur branch April 22, 2026 01:05

Conversation

jameshadfield commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

To Do

Uh oh!

Uh oh!

codecov Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jameshadfield commented Apr 14, 2026

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joverlee521 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

joverlee521 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

jameshadfield Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

joverlee521 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

jameshadfield Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jameshadfield commented Feb 9, 2026 •

edited

Loading

codecov Bot commented Apr 1, 2026 •

edited

Loading