Conversation
4ca0213 to
4786af7
Compare
Augur PR <nextstrain/augur#1958> Auspice PR TODO XXX
4786af7 to
242ced3
Compare
Augur PR <nextstrain/augur#1958> Auspice PR TODO XXX
242ced3 to
889a7a3
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1958 +/- ##
==========================================
+ Coverage 75.06% 75.11% +0.05%
==========================================
Files 83 83
Lines 9708 9810 +102
Branches 1969 1998 +29
==========================================
+ Hits 7287 7369 +82
- Misses 2084 2096 +12
- Partials 337 345 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Follows the work done in Augur <nextstrain/augur#1958> which relaxes the JSON schema to allow annotations without a `nuc` entry. We create a placeholder one in Auspice which spans the observed CDS (nuc) ranges so that the entropy rendering code can still operate.
Augur PR <nextstrain/augur#1958> Auspice PR <nextstrain/auspice#2040>
joverlee521
left a comment
There was a problem hiding this comment.
Left general design questions, but the only blocking comment is to verify single gene vs multiple genes defined in a single file.
Augur PR <nextstrain/augur#1958> Auspice PR <nextstrain/auspice#2040>
a0c8197 to
d1d7522
Compare
Review feedback by @joverlee521 <#1958 (comment)> Using an explicit argument when the root sequence represents proteins improves both code clarity and (self) documentation.
889a7a3 to
72dad75
Compare
Allows for protein-only analyses to be exportable. Note that Auspice PR <nextstrain/auspice#2040> is needed for the entropy panel to display for such datasets.
Augur PR <nextstrain/augur#1958> Auspice PR <nextstrain/auspice#2040>
Review feedback by @joverlee521 <#1958 (comment)> Using an explicit argument when the root sequence represents proteins improves both code clarity and (self) documentation.
72dad75 to
db37bb1
Compare
joverlee521
left a comment
There was a problem hiding this comment.
Thanks for going through the other augur commands (sorry I opened the rabbit hole)! My main concern is the auto-detect of sequence types in augur index can potentially change behavior with the sequence filters.
| sequence_filter_group.add_argument('--min-length', type=int, help=descriptions['min_length']) | ||
| sequence_filter_group.add_argument('--max-length', type=int, help=descriptions['max_length']) | ||
| sequence_filter_group.add_argument('--non-nucleotide', action='store_true', help=descriptions['non_nucleotide']) | ||
| sequence_filter_group.add_argument('--exclude-invalid', action='store_true', help=descriptions['exclude_invalid']) |
There was a problem hiding this comment.
Related to my other comment for index, thoughts on adding a --seq-type option here to be able to override auto-detect.
If I somehow have a bad nucleotide sequence "AAAETCGG", I'd expect it to be filtered out by --exclude-invalid, but in this case it would auto-detected as "aa" and pass the filter.
There was a problem hiding this comment.
Pushed up 00d5223 to show example Cram test that passes on master, but not on this branch.
There was a problem hiding this comment.
If we go with --seq-type I'll change refine's proposed --aa flag to --seq-type aa, yeah?
There was a problem hiding this comment.
Oh huh, yeah probably best to match the options across commands. So I guess there's two choices here
- Keep auto-detect for index/filter. All commands support
--seq-type aa/nucto override. - No auto-detect at all. All commands default to
nucand support the--aaflag to switch toaa.
We could also add a new envvar AUGUR_SEQ_TYPE to set the seq type across multiple commands.
There was a problem hiding this comment.
As discussed in dev chat, moved to --seq-type throughout. We can add AUGUR_SEQ_TYPE later as desired.
TreeTime already supported (one!) AA model, so all that was required on the Augur side was adding an "--seq-type" arg and changing the TreeTime parameterisation accordingly.
Adds a fair bit of code complexity but one step closer to augur pipelines which are protein-only. The actual reconstruction is unchanged, as we were already doing independent nuc / aa reconstructions. There are some rough edges to be tidied up in subsequent commits: 1. The annotation parsing requires the file to define the nucleotide coordinates, which we don't need for a protein-only analysis. Allowing the nucleodide definition to be optional is somewhat complicated, and most annotations files will have them anyways. However if we're only reconstructing a single gene it'd be really nice to make the annotations file optional, we can make up dummy nuc coordinates for the gene (e.g. 1..3*aa_len) easily. 2. Allow a (AA) root-sequence to be provided. There are different ways to do this, but the simplest would be to allow them to be provided analogously to the ``--translations`` argument.
(if we are only reconstructing amino-acid sequences). For a typical single-gene amino-acid analysis it would often be frustrating to need to create a genemap when we can create dummy nucleotide coordinates to represent the gene and the interpretation of result will be just as valid.
Review feedback by @joverlee521 <#1958 (comment)> Using an explicit argument when the root sequence represents proteins improves both code clarity and (self) documentation.
This was motivated by the desire to allow sequence-based filters for `augur filter`, which will require the index to be run for AA sequences. The choice to use an explicit `--seq-type` argument came from PR discussion <#1958 (comment)>
db37bb1 to
e6a346c
Compare
Augur PR <nextstrain/augur#1958> Auspice PR <nextstrain/auspice#2040>
e6a346c to
e93b48e
Compare
This was motivated by the desire to allow sequence-based filters for `augur filter`, which will require the index to be run for AA sequences. The choice to use an explicit `--seq-type` argument came from PR discussion <#1958 (comment)>
explicitly to check behaviour of whether non-A/T/G/C characters count towards the length (they don't - good!)
The only change for nucleotide sequences (i.e. every existing usage) is the argument change of '--non-nucleotide' to '--exclude-invalid'. (I think the new argument is more self-explanatory!)
e93b48e to
4280dc2
Compare
Deprecates the 'non_nucleotide' key (see parent commit)
4280dc2 to
e6a6972
Compare
This allows protein sequences to be analysed in a typical augur pipeline (align, tree, refine, ancestral, traits, export)
I've created this development workflow in seasonal-flu to run real-life AA-only analyses for testing purposes. I have a PR on docs.nextstrain.org with a guide on how to run such analyses.
See commit messages for details
This PR is on top of #1975
This will close #820 but needs to be paired with new docs and a new auspice version
To Do
XnotNfor aa seqs (#522) neherlab/treetime#523 is merged and a new treetime release made, then enforce a minimum treetime version for protein analyses (not actually needed, our ancestral reconstruction is being done withJC69(alphabet=aa))