Extend Clone to single-cell context #317

scharch · 2020-01-16T21:32:04Z

Starting to think about this in the context of generating a lot of 10x VDJ data... it seems we will want to (eventually) have a way for Clones to contain cells (see #273 (comment)), instead of (or maybe in addition to) Rearrangements.

Just a marker for now, need to think more about what kind of representation would make sense...

Issues to be resolved:

How to represent multiple chains? Are they embedded in a single Clone object, do we have multiple Clone rows (which introduces other problems), do we create a separate CloneChain object, or something else?
What are the key relationships with other AIRR objects and how/where are the identifiers stored?

The text was updated successfully, but these errors were encountered:

schristley · 2020-07-14T15:20:10Z

Should the Clone definition also contain both chains? Right now it seems to support only one.

scharch · 2020-07-14T16:52:24Z

@schristley, I think it will have to support germline_alignment (and all the related fields) as an array, of some sort, yes.

schristley · 2020-07-15T18:33:06Z

In a separate call, @bussec and I discussed how to do this flexibly. It would be nice not to be limited to strictly two chains. It also is hard to come up with a terminology that covers both T and B cells. There was also the desire to be able to annotate non-productive chains. Using a dictionary or array object should allow multiple entries. Using a controlled vocabulary, we could use T and B cell specific terms to annotate/tag the chains. At the same time, we should make it easy to access the primary annotations directly.

javh · 2020-07-15T18:48:50Z

It also is hard to come up with a terminology that covers both T and B cells.

This is a rather vexing problem. We've been using "heavy" for IGH, TRB and TRD and "light" for IGK/L, TRA and TRG, which is wrong. Maybe long_chain and short_chain?

schristley · 2020-07-15T19:04:18Z

It also is hard to come up with a terminology that covers both T and B cells.

This is a rather vexing problem. We've been using "heavy" for IGH, TRB and TRD and "light" for IGK/L, TRA and TRG, which is wrong. Maybe long_chain and short_chain?

I heard a suggestion like "d-containing chain" and "not-d" but there's the concern it's not very robust. My question would be, do we have to have the same name? Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

Sure, tools would have to handle them specifically, but wouldn't they kinda have to do that anyways, like tools would want to know regardless if it was IGH versus TRB?

scharch · 2020-07-15T19:19:56Z

Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

@schristley I was just coming here to suggest essentially the same thing.

It'll still get complicated, though: if each chain is a dict with keys something like {id, type, is_productive}, then a Cell would be an array of those and the "members" of Clone ends up being an array of arrays of dicts. Does that seem workable?

At the same time, we should make it easy to access the primary annotations directly.

Each Cell in the Clone has a cell_id and a list of sequence_ids that link back to the rearrangements TSV - do you think that is sufficient?

javh · 2020-07-15T19:27:39Z

Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

This is hard to use (have to check every object for field presence before fetching data), set required fields for (none or all are required?), and convert to a TSV (lots of missing data). But, it would be more explicit and support dual BCR+TCR expressing cells if you believe in such things:

https://doi.org/10.1016/j.cell.2019.05.007

scharch · 2020-07-15T19:38:45Z

@javh are we really trying to support conversion from a clones.json file to TSV? I have so many questions about how that would work even aside from this.

Anyway, I think that having a type field would help with the parsing you are concerned about.

{ 
    cell:'cell_id',
    type:'b_cell'
    heavy_chain: [ 'sequence_id1' ],
    light_chain: ['sequence_id2', 'sequence_id3' ]
}

But probably even better would be something like

{
    cell:'cell_id',
    type:'b_cell',
    chains:[
                  { sequence:'sequence_id1', type:'heavy_chain',... },
                  { sequence:'sequence_id2', type:'light_chain',... },
                  { sequence:'sequence_id3', type:'light_chain',... },
               ]
}

schristley · 2020-07-15T19:40:51Z

Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type?

This is hard to use (have to check every object for field presence), set required fields for (none or all are required?), and convert to a TSV (lots of missing data). But, it would be more explicit and support dual BCR+TCR expressing cells if you believe in such things:

I still need to think through the Cell-Clone relationship, but focussing purely on Clone right now, we could still have explicit fields name, but with generic names (chain_1, chain_2, primary_chain, secondary_chain, long_chain, short_chain). Actually, as a matter of fact, maybe keep the exact same Clone fields we have right now (v_call, j_call, etc.) but just add new fields for the second chain. And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this

v_call:
    type: string
chain_type:
    type: string
    enum:
        - IGH
        - TRB
v_call_1:
   type: string
chain_type_1:
    type: string
    enum:
        - IGL
        - TRA

This supports the main idea of two (productive) chains directly, with little ambiguity about what's what. Tools which don't "think" about this would just use the current Clone object as it. We could then have an optional dictionary/array where additional chains can be enumerated.

javh · 2020-07-15T19:59:16Z

@javh are we really trying to support conversion from a clones.json file to TSV? I have so many questions about how that would work even aside from this.

I don't know. Probably only if a need arises. Though, naively, it looks trivial to my eye. You use clone_id as the row key and exclude the sequences field. If you need the individual sequence level data, you'd then search the Rearrangement data by clone_id. Then it's just a clone summary table. But, that's without considering Cell.

Some sort of type field seems like it might be a solution. Though, you'd still have to do a check of some kind, but it would be a simpler check.

The way Clone is setup right now seems really geared towards IGH/TRB/TRD data only. Hrm.

schristley · 2020-07-15T20:48:37Z

Each Cell in the Clone has a cell_id and a list of sequence_ids that link back to the rearrangements TSV - do you think that is sufficient?

I'm still thinking through this. A single Clone object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell to point to its Clone versus having Clone contain a list of cells. Furthermore, if you gather up all the rearrangements for all those Cells, is that the same list of rearrangements in Clone's sequences array?

scharch · 2020-07-16T16:03:37Z

And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this

I think this could work, but the way you've sketched it out, it's hard to see how we'd account for non-productive rearrangements. Maybe that's rare enough or unimportant enough that it doesn't matter, but I typically bring them along and use them as additional evidence when doing clonality calculations.

A single Clone object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell to point to its Clone versus having Clone contain a list of cells.

Yes but why treat Cells differently than Rearrangements here? Biologically, the Clone is comprised of Cells, not Rearrangements...

Furthermore, if you gather up all the rearrangements for all those Cells, is that the same list of rearrangements in Clone's sequences array?

Sort of? Not the way it's currently set up with only one chain, but this should be correct under the extension models we are discussing.

schristley · 2020-07-16T16:57:54Z

And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this

I think this could work, but the way you've sketched it out, it's hard to see how we'd account for non-productive rearrangements. Maybe that's rare enough or unimportant enough that it doesn't matter, but I typically bring them along and use them as additional evidence when doing clonality calculations.

An optional extended data structure like you suggested above for providing additional chains.

schristley · 2020-07-16T17:03:33Z

A single Clone object is suppose to represent the whole clonal lineage, all cells and corresponding rearrangements? If that's the case, it's likely better for each Cell to point to its Clone versus having Clone contain a list of cells.

Yes but why treat Cells differently than Rearrangements here? Biologically, the Clone is comprised of Cells, not Rearrangements...

"better" only in a data structure sense. As a Cell belongs to one Clone, it could be represented with a single field clone_id, while a Clone containing many Cells would require an array of cell_ids.

bcorrie · 2022-02-09T18:43:37Z

OK, we are currently implementing 10X data loading for rearrangements/clones/cells/expression.

We can currently load everything in principal and practice, based on the current AIRR Spec.

The problem arises when you try to map a specific tool chain (e.g. 10X cellranger) to the spec, in particular one that generates all of the data types as part of one processing run - when everything blows up.

I think this issue is the crux of the matter - and we appear to have been avoiding it since July 2020 8-)

In the 10X case you get:

A single clone_id has multiple chains. I have seen two and three chains thus far for a single clone_id
Our current Clone object is focused on a single chain only
Pretty well all fields in the Clone object that describe the clone (VDJ calls, junction, alignment, sequences) need to be different for each of the chains in the clone (not just the VDJ calls as discussed above). I count 18 fields based on a quick count.

So we can't really load 10X data in a particularly logical or coherent fashion when you try to do all of Rearrangements/Clones/Cells in a single repository. I am pretty sure this would also mean that you couldn't represent said data in a set of files on disk using a Manifest to tie them together...

This seems like something that should be pretty high on the priority list if we really want to claim that we have a working Rearrangement/Clone/Cell spec 8-)

scharch · 2022-02-23T05:31:25Z

The solutions, which isn't perfect, is to introduce a second identifier, in this case data_processing_id which splits the N-N relationships into M number of 1-N relationships. M here being the number of different data processings. So how does that work concretely, well every object that is the "output" of a data processing (like Clone) has a data_processing_id. Thus data_processing_id can be used to partition the whole Clone table into subsets. We've talked about this with rearrangements, imagine processing with IgBlast and Mixcr as two separate data processings, they can be stored together yet separated by their different data_processing_ids.

So the Cell could have a list of clones, which is a compound identifier (clone_id, data_processing_id)
clones: [{clone_id:123, data_processing_id:456}, {clone_id:abc1, data_processing_id:567}]

I see. This makes sense to me, you'd just update the Cell record(s) in the repository to add a new compound identifier to the list. Seems reasonable enough...

schristley · 2023-02-04T21:42:36Z

@scharch @javh It's been awhile since the last discussion burst. Do you think we've enough concrete ideas to adjust the draft objects?

There's going to be a large set of single-cell studies coming down the pipe and going into the ADC, it would be good to implement some of these ideas and see how they work.

bcorrie · 2023-02-06T18:31:24Z

FYI we have loaded one 10X single cell study into the ADC already (with rearrangements, clones, cells, and GEX), and our clone compromise was to choose one of the chains for clone_id, create a single clone, and store consensus clone data (VDJ+Junction) from one chain. You can find the rearrangement for the other chain using the clone_id in the Rearrangement collection, but due to limits in our clone object we store only a single VDJ/Junction.

When you load the data into a repository you choose which chain you want to focus on.

Far from ideal but seemed like a decent compromise.

bcorrie · 2024-02-06T20:16:12Z

This one seems like a big one - I think we need to decide as to whether this gets fixed as part of v2.0 or is noted as a weakness/gap in the standard that is not currently addressed.

scharch · 2024-02-06T20:24:31Z

Yeah this is one of the ones on my personal to-do list...

bcorrie · 2024-02-06T20:34:44Z

We now have 4 single-cell 10X studies in ADC, and each of the study's Clone data is loaded with a single chain only (as described above), even though the clone in this case is a paired chain clone.

It would be nice if we could fix this (although it means I would need to update a bunch of data) 8-)

schristley · 2024-02-06T20:52:26Z

15 single-cell 10X studies in ADC actually, though the studies in the VDJServer repository have not loaded Clone data. The backlog of studies is still steadily growing. I think this is one of the high priority items that is needed if the ADC is to grow beyond just rearrangement data.

scharch · 2024-02-06T21:11:48Z

OK I'll try to put a PR together for discussion on the March call...

bcorrie · 2024-02-07T19:27:57Z

15 single-cell 10X studies in ADC actually, though the studies in the VDJServer repository have not loaded Clone data.

Yes, I meant there are 4 10X studies with loaded Clone data - with that data loaded in an "unsatisfactory" way because Clone is not oriented towards paired chains.

scharch · 2024-02-29T15:45:08Z

If we adapt Clone so that it can contain Cells, do we need a way to connect/partition the Rearrangements within each Cell? This goes beyond heavy/light/alpha/beta. Example: T cell clone with two TCRa chains, maybe even using the same V gene...

schristley · 2024-03-02T16:54:01Z

If we adapt Clone so that it can contain Cells, do we need a way to connect/partition the Rearrangements within each Cell? This goes beyond heavy/light/alpha/beta. Example: T cell clone with two TCRa chains, maybe even using the same V gene...

@scharch If I understand what you mean, this is already there with cell_id in the rearrangement object. That let's you pull out the rearrangements for a specific Cell.

scharch · 2024-03-03T01:10:34Z

No, I mean Cell1 has rearrangements TRB123, TRA456, and TRA789. Cell2 has rearrangements TRB098, TRA765, and TRA432. Do we need to be able to link TRA456 as corresponding to TRA432 vs TRA765 (or even TRB098, though that's easier to code around).

schristley · 2024-03-03T19:36:15Z

No, I mean Cell1 has rearrangements TRB123, TRA456, and TRA789. Cell2 has rearrangements TRB098, TRA765, and TRA432. Do we need to be able to link TRA456 as corresponding to TRA432 vs TRA765 (or even TRB098, though that's easier to code around).

Sorry, I'm still not understanding. Is the "meaning" of the link to say that those are the "equivalent" chains in two different Cells? If that's the case, won't the VDJ calls (plus maybe CDR3) be sufficient to imply this connection? I mean, if two Cells are in the same Clone, the TRB gene should be the same in both Cells. Likewise for the alpha chain. I can see there might be some ambiguity with B cells and SHM.

I guess another way to ask the question is how would you use that link? What problem would it solve for you?

scharch · 2024-03-03T21:12:24Z

If that's the case, won't the VDJ calls (plus maybe CDR3) be sufficient to imply this connection?

For the researcher looking at the data? Almost certainly. The question is if we need to make it easy to do by code.

how would you use that link? What problem would it solve for you?

Dunno. I was asking if it was something worth designing around when I'm trying to figure out an updated Clone schema. If no one has a use case, then that's my answer :)

schristley · 2024-03-04T01:21:06Z

@scharch Even though it might not be in our list of requirements, I'll note that Clone is perfectly amenable to a TSV format if only that pesky sequences array is dealt with. There could be considerable benefit and uptake to the Clone spec if toolchains like Immcantation and Repcalc, which are already processing clone TSV files (I may not be completely correct about that), don't need significant retooling to support AIRR. Bonus points in that programs that calculate things like gene usage and CDR3 length distributions that run on rearrangement TSVs, could run on Clone TSVs without change.

scharch · 2024-03-04T01:44:48Z

if only that pesky sequences array is dealt with

that programs that calculate things like gene usage and CDR3 length distributions that run on rearrangement TSVs, could run on Clone TSVs without change

It seems to me like you are imagining something entirely different, more a list of inferred naive ancestor across an entire Repertoire. I can see the value in the that, but Clone is more geared toward in-depth analysis of a small number of lineages. And, to be frank, it's very B cell biased. Hard to think of a T cell use case that would be worth an entire Clone, but maybe that's your point.

BUT! I think #769 can solve this, too. There, I am proposing representing inferred naive ancestors as "nonphysical" Rearrangements (or Cells). So if I am understanding correctly what you want, ~~you could just filter for nonphysical==True et voila!~~

Edit: It's probably not that simple. You'd probably have to iterate through Clone objects and extract the naive_ancestor from each. But the point is that it would still be present as a nonphysical rearrangement, and I'd bet it would be relatively straightforward to tweak that a little to help your use case.

schristley · 2024-03-04T02:12:53Z

It seems to me like you are imagining something entirely different, more a list of inferred naive ancestor across an entire Repertoire. I can see the value in the that, but Clone is more geared toward in-depth analysis of a small number of lineages. And, to be frank, it's very B cell biased. Hard to think of a T cell use case that would be worth an entire Clone, but maybe that's your point.

Hmm, probably, for T cells at least this is essentially a collapse of (potentially many) rearrangements records into a single record (with a count). Yeah, for B cells, is it a naive ancestral sequence? Or is it a consensus sequence? In either case you are right, it is a computationally inferred sequence (nonphysical might not be right descriptor) versus being an observed sequence.

That's even assuming I care about the sequence. I'm likely thinking about it wrong, as a smaller, more compact representation of data in the rearrangements (though still potentially large), while you are thinking about it as actual biology.

javh · 2024-03-04T03:09:41Z

I think it might matter where we in #768. If Cell is supposed to be just cell metadata, and not a container for sequence/expression/etc data, then maybe the same should be true of Clone? Meaning, instead of refs to Rearrangement/Cell in clone, we rely on clone_id in Rearrangement/Cell to link members of the same clone.

scharch · 2024-03-04T03:14:35Z

I think it might matter where we in #768. If Cell is supposed to be just cell metadata, and not a container for sequence/expression/etc data, then maybe the same should be true of Clone? Meaning, instead of refs to Rearrangement/Cell in clone, we rely on clone_id in Rearrangement/Cell to link members of the same clone.

@javh we have explicitly rejected this approach for Clone on the thought that (again, for B cells) clonality might be calculated multiple times in different ways which would get really messy this way. Plus, clonality might be recalculated after initial deposit/curation and we want Rearrangement and Cell records to be static.

scharch · 2024-03-04T03:17:48Z

(nonphysical might not be right descriptor)

"virtual" was already taken XD
happy for all suggestions

javh · 2024-03-04T18:39:26Z

@javh we have explicitly rejected this approach for Clone on the thought that (again, for B cells) clonality might be calculated multiple times in different ways which would get really messy this way. Plus, clonality might be recalculated after initial deposit/curation and we want Rearrangement and Cell records to be static.

I think we'll have that issue regardless, because we only have one clone_id field in Rearrangement.

scharch · 2024-03-04T19:01:23Z

@javh #446

bcorrie · 2024-08-12T18:54:38Z

Closes #778

scharch added the reactivity Reactivity label Jan 16, 2020

bcorrie mentioned this issue Feb 11, 2020

Things to do for the MiAIRR v2 release #305

Open

14 tasks

bcorrie mentioned this issue Apr 8, 2020

Uniqueness of _id fields in airr_schema.yaml #246

Closed

scharch mentioned this issue Jul 13, 2020

File format for AIRR Clone object? #421

Closed

schristley modified the milestones: ADC V1.1, AIRR v1.4.0 Jul 14, 2020

bussec mentioned this issue Jul 22, 2020

If I could redesign Repertoire and its buddies... #441

Closed

javh modified the milestones: AIRR v1.4.0, AIRR v2.0.0 Jan 11, 2021

kira-neller mentioned this issue Sep 3, 2021

Clone spec mapping for common tools #543

Closed

bcorrie assigned scharch Feb 6, 2024

bcorrie mentioned this issue Feb 27, 2024

Add Receptor fixes #705

Merged

scharch mentioned this issue Feb 29, 2024

Add a "nonphysical" keyword to Rearrangement and Cell #769

Open

scharch linked a pull request Mar 21, 2024 that will close this issue

Clone-schema-updates #778

Draft

6 tasks

bcorrie linked a pull request Aug 12, 2024 that will close this issue

Clone-schema-updates #778

Draft

6 tasks

javh added Clones Clone, tree and node schema topics and removed reactivity Reactivity labels Sep 9, 2024

Extend Clone to single-cell context #317

Extend Clone to single-cell context #317

Comments

scharch commented Jan 16, 2020 • edited by schristley Loading

schristley commented Jul 14, 2020

scharch commented Jul 14, 2020

schristley commented Jul 15, 2020

javh commented Jul 15, 2020

schristley commented Jul 15, 2020

scharch commented Jul 15, 2020

javh commented Jul 15, 2020 • edited Loading

scharch commented Jul 15, 2020

schristley commented Jul 15, 2020

javh commented Jul 15, 2020 • edited Loading

schristley commented Jul 15, 2020

scharch commented Jul 16, 2020

schristley commented Jul 16, 2020

schristley commented Jul 16, 2020

bcorrie commented Feb 9, 2022 • edited Loading

scharch commented Feb 23, 2022

schristley commented Feb 4, 2023

bcorrie commented Feb 6, 2023

bcorrie commented Feb 6, 2024

scharch commented Feb 6, 2024

bcorrie commented Feb 6, 2024

schristley commented Feb 6, 2024

scharch commented Feb 6, 2024

bcorrie commented Feb 7, 2024

scharch commented Feb 29, 2024

schristley commented Mar 2, 2024

scharch commented Mar 3, 2024

schristley commented Mar 3, 2024

scharch commented Mar 3, 2024

schristley commented Mar 4, 2024

scharch commented Mar 4, 2024 • edited Loading

schristley commented Mar 4, 2024

javh commented Mar 4, 2024 • edited Loading

scharch commented Mar 4, 2024

scharch commented Mar 4, 2024

javh commented Mar 4, 2024

scharch commented Mar 4, 2024

bcorrie commented Aug 12, 2024 • edited Loading

scharch commented Jan 16, 2020 •

edited by schristley

Loading

javh commented Jul 15, 2020 •

edited

Loading

javh commented Jul 15, 2020 •

edited

Loading

bcorrie commented Feb 9, 2022 •

edited

Loading

scharch commented Mar 4, 2024 •

edited

Loading

javh commented Mar 4, 2024 •

edited

Loading

bcorrie commented Aug 12, 2024 •

edited

Loading