Saving the "executable" not just the mapping #30

afmagee42 · 2024-11-22T14:32:48Z

          On the call today we talked a little bit about pickling. The trick with pickling anything other than basic Python objects is that you can run into versioning trouble, IIRC. A problem for another time.

Originally posted by @swo in #24 (comment)

The text was updated successfully, but these errors were encountered:

afmagee42 · 2024-11-22T15:35:59Z

I think I'm starting to get tangled in what all we want out of different sorts of reproducibility, and what that implies for what we want to do here. This is me thinking out loud about it.

I think we should establish a divide between exact reproducibility and portability.

Exact reproducibility is what we achieve storing the json version of an Aggregation. We can recode the exact same count data the exact same way.
- Portability is about results which are in some sense "comparable." I don't think this makes sense unless we talk about fixed-target aggregation (if one aggregates based on lineage abundance, of course one will get different results for different times/places). That is, we have to fix in advance the question of "what do we aggregate to?"

I think there are actually two layers to portability.

Would the same sequence (whether or not we ran it at the time) end up mapped to the same aggregated taxon?
Would the same input taxon (whether or not we ran it at the time) end up mapped to the same aggregated taxon?

1: sequence-level portability

Note that the sequence -> tip taxon link is explicitly and purposefully out of cladecombiner's scope. Nevertheless, this is worth considering as an exercise.

The only way to ensure sequence-level portability starts by taking the cladetime approach and re-running the sequences through the appropriately versioned (i.e., identical) lineage assignment tool. This means that future evolution is not a "problem," and we only have to ensure that the same set of all possible input taxa on that day (tip taxa that were assignable) end up in the same aggregated taxa.

2: taxon-level portability

This second question is within cladecombiner's scope. The mapping will change if the tree changes, or if some part of the descision-making process changes.

The relationships between taxa within an alias are fixed. The tree for EXAMPLE.1, EXAMPLE.1.2, and EXAMPLE.2.2 is not going to change. But how a taxon de-aliases can change if the alias key changes. This makes me think I may have mis-classified #27 as an exact reproducibility issue when in fact it's about portability. Nevertheless, we definitely want to be able to track that.

Changes to cladecombiner source code could result in either changes to the tree (via bugs or by how recombinants are handled, possibly other things I'm not currently seeing) or by how it makes mapping decisions given a tree. So tracking versioning information of cladecombiner itself is also going to be important.

3: approximate sequence-level portability

Via something like #9, we could approximate sequence-level portability. That is, if on future date F we call lineage FUTURE.1.2, which has no alias known on past date P, we can:

Collapse all such taxa without aliases at time P into the taxa that did have aliases then using the alias map at time F.
Apply taxon-level portability to the results.

There are a number of assumptions being made here, which I'm not currently crystal clear on, but which we would need to spell out if trying this.

Taxon recognition

So far we have ignored the issue of whether a particular taxon is recognized, i.e., in https://github.com/cov-lineages/pango-designation/blob/master/lineages.csv. That is, a taxon considered valid and which could show up in data today might not be considered valid and show up in data next month. This is not an issue for true sequence-level portability (the taxon would be valid under the correct assigner) or taxon-level portability (it's out of scope) but would be for approximate sequence-level portability. I'm not sure there's anything to do about it but if we pursue this we could perhaps warn users (check for demotions).

swo · 2024-11-22T20:09:32Z

To make sure I'm following, the idea here is to be able to say what modeling unit you would have mapped a particular taxon to, if that taxon was not present when you called cladecombiner in the past?

afmagee42 mentioned this issue Nov 22, 2024

Documentation #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving the "executable" not just the mapping #30

Saving the "executable" not just the mapping #30

afmagee42 commented Nov 22, 2024

afmagee42 commented Nov 22, 2024

swo commented Nov 22, 2024

Saving the "executable" not just the mapping #30

Saving the "executable" not just the mapping #30

Comments

afmagee42 commented Nov 22, 2024

afmagee42 commented Nov 22, 2024

1: sequence-level portability

2: taxon-level portability

3: approximate sequence-level portability

Taxon recognition

swo commented Nov 22, 2024