Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving the "executable" not just the mapping #30

Open
afmagee42 opened this issue Nov 22, 2024 · 2 comments
Open

Saving the "executable" not just the mapping #30

afmagee42 opened this issue Nov 22, 2024 · 2 comments

Comments

@afmagee42
Copy link
Collaborator

          On the call today we talked a little bit about pickling. The trick with pickling anything other than basic Python objects is that you can run into versioning trouble, IIRC. A problem for another time.

Originally posted by @swo in #24 (comment)

@afmagee42 afmagee42 mentioned this issue Nov 22, 2024
@afmagee42
Copy link
Collaborator Author

I think I'm starting to get tangled in what all we want out of different sorts of reproducibility, and what that implies for what we want to do here. This is me thinking out loud about it.

I think we should establish a divide between exact reproducibility and portability.

  • Exact reproducibility is what we achieve storing the json version of an Aggregation. We can recode the exact same count data the exact same way.
    • Portability is about results which are in some sense "comparable." I don't think this makes sense unless we talk about fixed-target aggregation (if one aggregates based on lineage abundance, of course one will get different results for different times/places). That is, we have to fix in advance the question of "what do we aggregate to?"

I think there are actually two layers to portability.

  1. Would the same sequence (whether or not we ran it at the time) end up mapped to the same aggregated taxon?
  2. Would the same input taxon (whether or not we ran it at the time) end up mapped to the same aggregated taxon?

1: sequence-level portability

Note that the sequence -> tip taxon link is explicitly and purposefully out of cladecombiner's scope. Nevertheless, this is worth considering as an exercise.

The only way to ensure sequence-level portability starts by taking the cladetime approach and re-running the sequences through the appropriately versioned (i.e., identical) lineage assignment tool. This means that future evolution is not a "problem," and we only have to ensure that the same set of all possible input taxa on that day (tip taxa that were assignable) end up in the same aggregated taxa.

2: taxon-level portability

This second question is within cladecombiner's scope. The mapping will change if the tree changes, or if some part of the descision-making process changes.

The relationships between taxa within an alias are fixed. The tree for EXAMPLE.1, EXAMPLE.1.2, and EXAMPLE.2.2 is not going to change. But how a taxon de-aliases can change if the alias key changes. This makes me think I may have mis-classified #27 as an exact reproducibility issue when in fact it's about portability. Nevertheless, we definitely want to be able to track that.

Changes to cladecombiner source code could result in either changes to the tree (via bugs or by how recombinants are handled, possibly other things I'm not currently seeing) or by how it makes mapping decisions given a tree. So tracking versioning information of cladecombiner itself is also going to be important.

3: approximate sequence-level portability

Via something like #9, we could approximate sequence-level portability. That is, if on future date F we call lineage FUTURE.1.2, which has no alias known on past date P, we can:

  1. Collapse all such taxa without aliases at time P into the taxa that did have aliases then using the alias map at time F.
  2. Apply taxon-level portability to the results.

There are a number of assumptions being made here, which I'm not currently crystal clear on, but which we would need to spell out if trying this.

Taxon recognition

So far we have ignored the issue of whether a particular taxon is recognized, i.e., in https://github.com/cov-lineages/pango-designation/blob/master/lineages.csv. That is, a taxon considered valid and which could show up in data today might not be considered valid and show up in data next month. This is not an issue for true sequence-level portability (the taxon would be valid under the correct assigner) or taxon-level portability (it's out of scope) but would be for approximate sequence-level portability. I'm not sure there's anything to do about it but if we pursue this we could perhaps warn users (check for demotions).

@swo
Copy link
Collaborator

swo commented Nov 22, 2024

To make sure I'm following, the idea here is to be able to say what modeling unit you would have mapped a particular taxon to, if that taxon was not present when you called cladecombiner in the past?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants