Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Clone to single-cell context #317

Open
2 tasks
scharch opened this issue Jan 16, 2020 · 56 comments · May be fixed by #778
Open
2 tasks

Extend Clone to single-cell context #317

scharch opened this issue Jan 16, 2020 · 56 comments · May be fixed by #778
Assignees
Labels
Clones Clone, tree and node schema topics
Milestone

Comments

@scharch
Copy link
Contributor

scharch commented Jan 16, 2020

Starting to think about this in the context of generating a lot of 10x VDJ data... it seems we will want to (eventually) have a way for Clones to contain cells (see #273 (comment)), instead of (or maybe in addition to) Rearrangements.

Just a marker for now, need to think more about what kind of representation would make sense...

Issues to be resolved:

  • How to represent multiple chains? Are they embedded in a single Clone object, do we have multiple Clone rows (which introduces other problems), do we create a separate CloneChain object, or something else?
  • What are the key relationships with other AIRR objects and how/where are the identifiers stored?
@schristley
Copy link
Member

Should the Clone definition also contain both chains? Right now it seems to support only one.

@schristley schristley modified the milestones: ADC V1.1, AIRR v1.4.0 Jul 14, 2020
@scharch
Copy link
Contributor Author

scharch commented Jul 14, 2020

@schristley, I think it will have to support germline_alignment (and all the related fields) as an array, of some sort, yes.

@schristley
Copy link
Member

In a separate call, @bussec and I discussed how to do this flexibly. It would be nice not to be limited to strictly two chains. It also is hard to come up with a terminology that covers both T and B cells. There was also the desire to be able to annotate non-productive chains. Using a dictionary or array object should allow multiple entries. Using a controlled vocabulary, we could use T and B cell specific terms to annotate/tag the chains. At the same time, we should make it easy to access the primary annotations directly.

@javh
Copy link
Contributor

javh commented Jul 15, 2020

It also is hard to come up with a terminology that covers both T and B cells.

This is a rather vexing problem. We've been using "heavy" for IGH, TRB and TRD and "light" for IGK/L, TRA and TRG, which is wrong. Maybe long_chain and short_chain?

@schristley
Copy link
Member

It also is hard to come up with a terminology that covers both T and B cells.

This is a rather vexing problem. We've been using "heavy" for IGH, TRB and TRD and "light" for IGK/L, TRA and TRG, which is wrong. Maybe long_chain and short_chain?

I heard a suggestion like "d-containing chain" and "not-d" but there's the concern it's not very robust. My question would be, do we have to have the same name? Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

Sure, tools would have to handle them specifically, but wouldn't they kinda have to do that anyways, like tools would want to know regardless if it was IGH versus TRB?

@scharch
Copy link
Contributor Author

scharch commented Jul 15, 2020

Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

@schristley I was just coming here to suggest essentially the same thing.

It'll still get complicated, though: if each chain is a dict with keys something like {id, type, is_productive}, then a Cell would be an array of those and the "members" of Clone ends up being an array of arrays of dicts. Does that seem workable?

At the same time, we should make it easy to access the primary annotations directly.

Each Cell in the Clone has a cell_id and a list of sequence_ids that link back to the rearrangements TSV - do you think that is sufficient?

@javh
Copy link
Contributor

javh commented Jul 15, 2020

Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

This is hard to use (have to check every object for field presence before fetching data), set required fields for (none or all are required?), and convert to a TSV (lots of missing data). But, it would be more explicit and support dual BCR+TCR expressing cells if you believe in such things:

https://doi.org/10.1016/j.cell.2019.05.007

@scharch
Copy link
Contributor Author

scharch commented Jul 15, 2020

@javh are we really trying to support conversion from a clones.json file to TSV? I have so many questions about how that would work even aside from this.

Anyway, I think that having a type field would help with the parsing you are concerned about.

{ 
    cell:'cell_id',
    type:'b_cell'
    heavy_chain: [ 'sequence_id1' ],
    light_chain: ['sequence_id2', 'sequence_id3' ]
}

But probably even better would be something like

{
    cell:'cell_id',
    type:'b_cell',
    chains:[
                  { sequence:'sequence_id1', type:'heavy_chain',... },
                  { sequence:'sequence_id2', type:'light_chain',... },
                  { sequence:'sequence_id3', type:'light_chain',... },
               ]
}

@schristley
Copy link
Member

Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

This is hard to use (have to check every object for field presence), set required fields for (none or all are required?), and convert to a TSV (lots of missing data). But, it would be more explicit and support dual BCR+TCR expressing cells if you believe in such things:

I still need to think through the Cell-Clone relationship, but focussing purely on Clone right now, we could still have explicit fields name, but with generic names (chain_1, chain_2, primary_chain, secondary_chain, long_chain, short_chain). Actually, as a matter of fact, maybe keep the exact same Clone fields we have right now (v_call, j_call, etc.) but just add new fields for the second chain. And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this

v_call:
    type: string
chain_type:
    type: string
    enum:
        - IGH
        - TRB
v_call_1:
   type: string
chain_type_1:
    type: string
    enum:
        - IGL
        - TRA

This supports the main idea of two (productive) chains directly, with little ambiguity about what's what. Tools which don't "think" about this would just use the current Clone object as it. We could then have an optional dictionary/array where additional chains can be enumerated.

@javh
Copy link
Contributor

javh commented Jul 15, 2020

@javh are we really trying to support conversion from a clones.json file to TSV? I have so many questions about how that would work even aside from this.

I don't know. Probably only if a need arises. Though, naively, it looks trivial to my eye. You use clone_id as the row key and exclude the sequences field. If you need the individual sequence level data, you'd then search the Rearrangement data by clone_id. Then it's just a clone summary table. But, that's without considering Cell.

Some sort of type field seems like it might be a solution. Though, you'd still have to do a check of some kind, but it would be a simpler check.

The way Clone is setup right now seems really geared towards IGH/TRB/TRD data only. Hrm.

@schristley
Copy link
Member

Each Cell in the Clone has a cell_id and a list of sequence_ids that link back to the rearrangements TSV - do you think that is sufficient?

I'm still thinking through this. A single Clone object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell to point to its Clone versus having Clone contain a list of cells. Furthermore, if you gather up all the rearrangements for all those Cells, is that the same list of rearrangements in Clone's sequences array?

@scharch
Copy link
Contributor Author

scharch commented Jul 16, 2020

And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this

I think this could work, but the way you've sketched it out, it's hard to see how we'd account for non-productive rearrangements. Maybe that's rare enough or unimportant enough that it doesn't matter, but I typically bring them along and use them as additional evidence when doing clonality calculations.

A single Clone object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell to point to its Clone versus having Clone contain a list of cells.

Yes but why treat Cells differently than Rearrangements here? Biologically, the Clone is comprised of Cells, not Rearrangements...

Furthermore, if you gather up all the rearrangements for all those Cells, is that the same list of rearrangements in Clone's sequences array?

Sort of? Not the way it's currently set up with only one chain, but this should be correct under the extension models we are discussing.

@schristley
Copy link
Member

And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this

I think this could work, but the way you've sketched it out, it's hard to see how we'd account for non-productive rearrangements. Maybe that's rare enough or unimportant enough that it doesn't matter, but I typically bring them along and use them as additional evidence when doing clonality calculations.

An optional extended data structure like you suggested above for providing additional chains.

@schristley
Copy link
Member

A single Clone object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell to point to its Clone versus having Clone contain a list of cells.

Yes but why treat Cells differently than Rearrangements here? Biologically, the Clone is comprised of Cells, not Rearrangements...

"better" only in a data structure sense. As a Cell belongs to one Clone, it could be represented with a single field clone_id, while a Clone containing many Cells would require an array of cell_ids.

@bcorrie
Copy link
Contributor

bcorrie commented Feb 9, 2022

OK, we are currently implementing 10X data loading for rearrangements/clones/cells/expression.

We can currently load everything in principal and practice, based on the current AIRR Spec.

The problem arises when you try to map a specific tool chain (e.g. 10X cellranger) to the spec, in particular one that generates all of the data types as part of one processing run - when everything blows up.

I think this issue is the crux of the matter - and we appear to have been avoiding it since July 2020 8-)

In the 10X case you get:

  • A single clone_id has multiple chains. I have seen two and three chains thus far for a single clone_id
  • Our current Clone object is focused on a single chain only
  • Pretty well all fields in the Clone object that describe the clone (VDJ calls, junction, alignment, sequences) need to be different for each of the chains in the clone (not just the VDJ calls as discussed above). I count 18 fields based on a quick count.

So we can't really load 10X data in a particularly logical or coherent fashion when you try to do all of Rearrangements/Clones/Cells in a single repository. I am pretty sure this would also mean that you couldn't represent said data in a set of files on disk using a Manifest to tie them together...

This seems like something that should be pretty high on the priority list if we really want to claim that we have a working Rearrangement/Clone/Cell spec 8-)

@scharch
Copy link
Contributor Author

scharch commented Feb 23, 2022

The solutions, which isn't perfect, is to introduce a second identifier, in this case data_processing_id which splits the N-N relationships into M number of 1-N relationships. M here being the number of different data processings. So how does that work concretely, well every object that is the "output" of a data processing (like Clone) has a data_processing_id. Thus data_processing_id can be used to partition the whole Clone table into subsets. We've talked about this with rearrangements, imagine processing with IgBlast and Mixcr as two separate data processings, they can be stored together yet separated by their different data_processing_ids.

So the Cell could have a list of clones, which is a compound identifier (clone_id, data_processing_id)

clones: [{clone_id:123, data_processing_id:456}, {clone_id:abc1, data_processing_id:567}]

I see. This makes sense to me, you'd just update the Cell record(s) in the repository to add a new compound identifier to the list. Seems reasonable enough...

@schristley
Copy link
Member

@scharch @javh It's been awhile since the last discussion burst. Do you think we've enough concrete ideas to adjust the draft objects?

There's going to be a large set of single-cell studies coming down the pipe and going into the ADC, it would be good to implement some of these ideas and see how they work.

@bcorrie
Copy link
Contributor

bcorrie commented Feb 6, 2023

FYI we have loaded one 10X single cell study into the ADC already (with rearrangements, clones, cells, and GEX), and our clone compromise was to choose one of the chains for clone_id, create a single clone, and store consensus clone data (VDJ+Junction) from one chain. You can find the rearrangement for the other chain using the clone_id in the Rearrangement collection, but due to limits in our clone object we store only a single VDJ/Junction.

When you load the data into a repository you choose which chain you want to focus on.

Far from ideal but seemed like a decent compromise.

@bcorrie
Copy link
Contributor

bcorrie commented Feb 6, 2024

This one seems like a big one - I think we need to decide as to whether this gets fixed as part of v2.0 or is noted as a weakness/gap in the standard that is not currently addressed.

@scharch
Copy link
Contributor Author

scharch commented Feb 6, 2024

Yeah this is one of the ones on my personal to-do list...

@bcorrie
Copy link
Contributor

bcorrie commented Feb 6, 2024

We now have 4 single-cell 10X studies in ADC, and each of the study's Clone data is loaded with a single chain only (as described above), even though the clone in this case is a paired chain clone.

It would be nice if we could fix this (although it means I would need to update a bunch of data) 8-)

@schristley
Copy link
Member

15 single-cell 10X studies in ADC actually, though the studies in the VDJServer repository have not loaded Clone data. The backlog of studies is still steadily growing. I think this is one of the high priority items that is needed if the ADC is to grow beyond just rearrangement data.

@scharch
Copy link
Contributor Author

scharch commented Feb 6, 2024

OK I'll try to put a PR together for discussion on the March call...

@bcorrie
Copy link
Contributor

bcorrie commented Feb 7, 2024

15 single-cell 10X studies in ADC actually, though the studies in the VDJServer repository have not loaded Clone data.

Yes, I meant there are 4 10X studies with loaded Clone data - with that data loaded in an "unsatisfactory" way because Clone is not oriented towards paired chains.

@scharch
Copy link
Contributor Author

scharch commented Feb 29, 2024

If we adapt Clone so that it can contain Cells, do we need a way to connect/partition the Rearrangements within each Cell? This goes beyond heavy/light/alpha/beta. Example: T cell clone with two TCRa chains, maybe even using the same V gene...

@schristley
Copy link
Member

If we adapt Clone so that it can contain Cells, do we need a way to connect/partition the Rearrangements within each Cell? This goes beyond heavy/light/alpha/beta. Example: T cell clone with two TCRa chains, maybe even using the same V gene...

@scharch If I understand what you mean, this is already there with cell_id in the rearrangement object. That let's you pull out the rearrangements for a specific Cell.

@scharch
Copy link
Contributor Author

scharch commented Mar 3, 2024

No, I mean Cell1 has rearrangements TRB123, TRA456, and TRA789. Cell2 has rearrangements TRB098, TRA765, and TRA432. Do we need to be able to link TRA456 as corresponding to TRA432 vs TRA765 (or even TRB098, though that's easier to code around).

@schristley
Copy link
Member

No, I mean Cell1 has rearrangements TRB123, TRA456, and TRA789. Cell2 has rearrangements TRB098, TRA765, and TRA432. Do we need to be able to link TRA456 as corresponding to TRA432 vs TRA765 (or even TRB098, though that's easier to code around).

Sorry, I'm still not understanding. Is the "meaning" of the link to say that those are the "equivalent" chains in two different Cells? If that's the case, won't the VDJ calls (plus maybe CDR3) be sufficient to imply this connection? I mean, if two Cells are in the same Clone, the TRB gene should be the same in both Cells. Likewise for the alpha chain. I can see there might be some ambiguity with B cells and SHM.

I guess another way to ask the question is how would you use that link? What problem would it solve for you?

@scharch
Copy link
Contributor Author

scharch commented Mar 3, 2024

If that's the case, won't the VDJ calls (plus maybe CDR3) be sufficient to imply this connection?

For the researcher looking at the data? Almost certainly. The question is if we need to make it easy to do by code.

how would you use that link? What problem would it solve for you?

Dunno. I was asking if it was something worth designing around when I'm trying to figure out an updated Clone schema. If no one has a use case, then that's my answer :)

@schristley
Copy link
Member

@scharch Even though it might not be in our list of requirements, I'll note that Clone is perfectly amenable to a TSV format if only that pesky sequences array is dealt with. There could be considerable benefit and uptake to the Clone spec if toolchains like Immcantation and Repcalc, which are already processing clone TSV files (I may not be completely correct about that), don't need significant retooling to support AIRR. Bonus points in that programs that calculate things like gene usage and CDR3 length distributions that run on rearrangement TSVs, could run on Clone TSVs without change.

@scharch
Copy link
Contributor Author

scharch commented Mar 4, 2024

if only that pesky sequences array is dealt with

that programs that calculate things like gene usage and CDR3 length distributions that run on rearrangement TSVs, could run on Clone TSVs without change

It seems to me like you are imagining something entirely different, more a list of inferred naive ancestor across an entire Repertoire. I can see the value in the that, but Clone is more geared toward in-depth analysis of a small number of lineages. And, to be frank, it's very B cell biased. Hard to think of a T cell use case that would be worth an entire Clone, but maybe that's your point.

BUT! I think #769 can solve this, too. There, I am proposing representing inferred naive ancestors as "nonphysical" Rearrangements (or Cells). So if I am understanding correctly what you want, you could just filter for nonphysical==True et voila!

Edit: It's probably not that simple. You'd probably have to iterate through Clone objects and extract the naive_ancestor from each. But the point is that it would still be present as a nonphysical rearrangement, and I'd bet it would be relatively straightforward to tweak that a little to help your use case.

@schristley
Copy link
Member

It seems to me like you are imagining something entirely different, more a list of inferred naive ancestor across an entire Repertoire. I can see the value in the that, but Clone is more geared toward in-depth analysis of a small number of lineages. And, to be frank, it's very B cell biased. Hard to think of a T cell use case that would be worth an entire Clone, but maybe that's your point.

Hmm, probably, for T cells at least this is essentially a collapse of (potentially many) rearrangements records into a single record (with a count). Yeah, for B cells, is it a naive ancestral sequence? Or is it a consensus sequence? In either case you are right, it is a computationally inferred sequence (nonphysical might not be right descriptor) versus being an observed sequence.

That's even assuming I care about the sequence. I'm likely thinking about it wrong, as a smaller, more compact representation of data in the rearrangements (though still potentially large), while you are thinking about it as actual biology.

@javh
Copy link
Contributor

javh commented Mar 4, 2024

I think it might matter where we in #768. If Cell is supposed to be just cell metadata, and not a container for sequence/expression/etc data, then maybe the same should be true of Clone? Meaning, instead of refs to Rearrangement/Cell in clone, we rely on clone_id in Rearrangement/Cell to link members of the same clone.

@scharch
Copy link
Contributor Author

scharch commented Mar 4, 2024

I think it might matter where we in #768. If Cell is supposed to be just cell metadata, and not a container for sequence/expression/etc data, then maybe the same should be true of Clone? Meaning, instead of refs to Rearrangement/Cell in clone, we rely on clone_id in Rearrangement/Cell to link members of the same clone.

@javh we have explicitly rejected this approach for Clone on the thought that (again, for B cells) clonality might be calculated multiple times in different ways which would get really messy this way. Plus, clonality might be recalculated after initial deposit/curation and we want Rearrangement and Cell records to be static.

@scharch
Copy link
Contributor Author

scharch commented Mar 4, 2024

(nonphysical might not be right descriptor)

"virtual" was already taken XD
happy for all suggestions

@javh
Copy link
Contributor

javh commented Mar 4, 2024

@javh we have explicitly rejected this approach for Clone on the thought that (again, for B cells) clonality might be calculated multiple times in different ways which would get really messy this way. Plus, clonality might be recalculated after initial deposit/curation and we want Rearrangement and Cell records to be static.

I think we'll have that issue regardless, because we only have one clone_id field in Rearrangement.

@scharch
Copy link
Contributor Author

scharch commented Mar 4, 2024

@javh #446

@scharch scharch linked a pull request Mar 21, 2024 that will close this issue
6 tasks
@bcorrie
Copy link
Contributor

bcorrie commented Aug 12, 2024

Closes #778

@bcorrie bcorrie linked a pull request Aug 12, 2024 that will close this issue
6 tasks
@javh javh added Clones Clone, tree and node schema topics and removed reactivity Reactivity labels Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clones Clone, tree and node schema topics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants