WIP [refine] unique internal node names#1451
Draft
jameshadfield wants to merge 3 commits intojames/export-multitreefrom
Draft
WIP [refine] unique internal node names#1451jameshadfield wants to merge 3 commits intojames/export-multitreefrom
jameshadfield wants to merge 3 commits intojames/export-multitreefrom
Conversation
Checking for duplicated node names and missing node names is in line with the schema. Previously some calls to `export v2` would be ok with missing node names (e.g. see the updated tests in `minify-output.t`) but any usage with metadata would result in an uncaught error.
Multiple trees ("subtrees") have been available in Auspice since late
2021¹ and part of the associated schema since early 2022². Despite this
there was no way to produce such datasets within Augur itself, and
despite the schema changes the associated `augur validate` command was
never updated to allow them.
This commit adds multi-tree inputs to `augur export v2` as well as
allowing them to validate with our associated validation commands.
¹ <nextstrain/auspice#1442>
² <#851>
Needed for pipelines which will produce multiple trees via `augur refine` and then supply these trees to `augur export v2`
tsibley
reviewed
Apr 24, 2024
Comment on lines
+353
to
+356
| id = hashlib.sha256("".join(terminals).encode('utf-8')).hexdigest()[0:7] | ||
| def rename(name): | ||
| if name not in internals: return name | ||
| return f"NODE_{id}_{name.split('_')[1]}" |
Contributor
There was a problem hiding this comment.
The hashing of the terminals of the tree feels arbitrary and not driven by the actual properties of the hash, esp. given the truncation of it. For example, two trees with the same terminals but different structures will collide. Why not simply produce unique ids for each node instead? (I still think we should do that.)
Member
Author
There was a problem hiding this comment.
For example, two trees with the same terminals but different structures will collide.
This functionality is motivated by multi-trees where terminal node names cannot be shared across trees.
ddba55a to
c40b821
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This builds on #1450 to rename the internal nodes of trees to be unique when comparing against other trees. For a pipeline which produces multiple trees and exports them together it's necessary to have unique internal nodes.
Ideally this code would be upstream in TreeTime, but it's ok within refine IMO.
Not intending to merge this PR as is - but i'm actively using it in pipelines which use multi-trees and so putting it here for 👀 and discussion.