Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clone-schema-updates #778

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Clone-schema-updates #778

wants to merge 4 commits into from

Conversation

scharch
Copy link
Contributor

@scharch scharch commented Mar 21, 2024

Draft of updates to Clone.
Closes:
#201 (incidentally)
#317
#333
#378
#769

Open questions, both related to the switch from Repertoire to RepertoireGroup:

  • Do we need a more specific way to address the _ids in members (etc)? Currently, one would have to search each Repertoire in the RepertoireGroup, which seems inefficient. OTOH, having members (etc) be arrays of arrays is uuuuuuugly (and not sure it's even allowed).
  • What do we do about data_processing_id? Can RepertoireGroups get their own DataProcessing objects?

Other remaining tasks before this can be merged:

  • Sync schemas
  • Unit tests for loading Clones
  • Docs updates
  • ...?

@scharch scharch added this to the AIRR 2.0 milestone Mar 21, 2024
@scharch
Copy link
Contributor Author

scharch commented Mar 25, 2024

From the call:

  • push members and inferred_members down to a single revived but skinny Node array
  • Node should have both sequence_id and cell_id with only one populated (but maybe add a level shortcut too?)
  • Change Cell.inferred and Rearrangement.inferred to an enum to match Repertoire.type and add this to Node as well
    • Does an inferred Node need to be part of a Repertoire at all? Maybe this allows us to drop that bit entirely...
    • preference was for 'observed' or 'physical' in the current enum
  • Add repertoire_id to Node to facilitate retrieval of underlying data
  • Did not discuss what to do about data_processing_id

@javh
Copy link
Contributor

javh commented Mar 25, 2024

Random thoughts:

  • For repertoire_type what about observed instead of physical?
  • Maybe the inferred boolean for Rearrangement could instead be sequence_type and use the same enum as repertoire_type? Same for Cell and a cell_type field.
  • As an alternative to level/members in Clone, we could have sequences / cells which are explicitly typed and set the expectation to only populate one (we could still retain level in this case).
  • Another alternative to level/members in Clone would be to have a different version of Node that contains the sequence_id/cell_id fields (without the junction and whatnot), and possibly the level/sequence_type/cell_type, and possibly the repertoire_id so that doesn't need to be fetched out of RepertoireGroup somehow. Ie, members contains Node references.

@schristley
Copy link
Member

  • What do we do about data_processing_id? Can RepertoireGroups get their own DataProcessing objects?

@scharch It's best to think of DataProcessing as an independent, orthogonal object to the other (top-level) AIRR objects. Anything that can be part of DataProcessing should have a data_processing_id so yes including one in RepertoireGroup makes sense.

@scharch
Copy link
Contributor Author

scharch commented May 9, 2024

OK, let me know how this looks.

  • I changed "physical" to "observed" and made rearrangement_type and cell_type enums instead of inferred booleans
  • I re-added Node as a way get repertoire_id in. It feels a little "overweight" but probably the best solution?

@javh
Copy link
Contributor

javh commented Aug 12, 2024

From the call:

  • We should take a closer look at it and see if it's ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a "nonphysical" keyword to Rearrangement and Cell Extend Clone to single-cell context
3 participants