-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend Clone to single-cell context #317
Comments
Should the Clone definition also contain both chains? Right now it seems to support only one. |
@schristley, I think it will have to support |
In a separate call, @bussec and I discussed how to do this flexibly. It would be nice not to be limited to strictly two chains. It also is hard to come up with a terminology that covers both T and B cells. There was also the desire to be able to annotate non-productive chains. Using a dictionary or array object should allow multiple entries. Using a controlled vocabulary, we could use T and B cell specific terms to annotate/tag the chains. At the same time, we should make it easy to access the primary annotations directly. |
This is a rather vexing problem. We've been using "heavy" for IGH, TRB and TRD and "light" for IGK/L, TRA and TRG, which is wrong. Maybe |
I heard a suggestion like "d-containing chain" and "not-d" but there's the concern it's not very robust. My question would be, do we have to have the same name? Can we call them heavy_chain, light_chain, alpha_chain, beta_chain, etc., with a controlled vocabulary specific to cell and chain type? Sure, tools would have to handle them specifically, but wouldn't they kinda have to do that anyways, like tools would want to know regardless if it was IGH versus TRB? |
@schristley I was just coming here to suggest essentially the same thing. It'll still get complicated, though: if each chain is a dict with keys something like
Each |
This is hard to use (have to check every object for field presence before fetching data), set required fields for (none or all are required?), and convert to a TSV (lots of missing data). But, it would be more explicit and support dual BCR+TCR expressing cells if you believe in such things: |
@javh are we really trying to support conversion from a clones.json file to TSV? I have so many questions about how that would work even aside from this. Anyway, I think that having a
But probably even better would be something like
|
I still need to think through the Cell-Clone relationship, but focussing purely on Clone right now, we could still have explicit fields name, but with generic names (chain_1, chain_2, primary_chain, secondary_chain, long_chain, short_chain). Actually, as a matter of fact, maybe keep the exact same Clone fields we have right now (v_call, j_call, etc.) but just add new fields for the second chain. And we require that the main fields be the heavy/long chain, while the second chain is the other. So something like this
This supports the main idea of two (productive) chains directly, with little ambiguity about what's what. Tools which don't "think" about this would just use the current Clone object as it. We could then have an optional dictionary/array where additional chains can be enumerated. |
I don't know. Probably only if a need arises. Though, naively, it looks trivial to my eye. You use Some sort of The way |
I'm still thinking through this. A single |
I think this could work, but the way you've sketched it out, it's hard to see how we'd account for non-productive rearrangements. Maybe that's rare enough or unimportant enough that it doesn't matter, but I typically bring them along and use them as additional evidence when doing clonality calculations.
Yes but why treat
Sort of? Not the way it's currently set up with only one chain, but this should be correct under the extension models we are discussing. |
An optional extended data structure like you suggested above for providing additional chains. |
"better" only in a data structure sense. As a |
OK, we are currently implementing 10X data loading for rearrangements/clones/cells/expression. We can currently load everything in principal and practice, based on the current AIRR Spec. The problem arises when you try to map a specific tool chain (e.g. 10X cellranger) to the spec, in particular one that generates all of the data types as part of one processing run - when everything blows up. I think this issue is the crux of the matter - and we appear to have been avoiding it since July 2020 8-) In the 10X case you get:
So we can't really load 10X data in a particularly logical or coherent fashion when you try to do all of Rearrangements/Clones/Cells in a single repository. I am pretty sure this would also mean that you couldn't represent said data in a set of files on disk using a This seems like something that should be pretty high on the priority list if we really want to claim that we have a working Rearrangement/Clone/Cell spec 8-) |
I see. This makes sense to me, you'd just update the |
@scharch @javh It's been awhile since the last discussion burst. Do you think we've enough concrete ideas to adjust the draft objects? There's going to be a large set of single-cell studies coming down the pipe and going into the ADC, it would be good to implement some of these ideas and see how they work. |
FYI we have loaded one 10X single cell study into the ADC already (with rearrangements, clones, cells, and GEX), and our clone compromise was to choose one of the chains for clone_id, create a single clone, and store consensus clone data (VDJ+Junction) from one chain. You can find the rearrangement for the other chain using the clone_id in the Rearrangement collection, but due to limits in our clone object we store only a single VDJ/Junction. When you load the data into a repository you choose which chain you want to focus on. Far from ideal but seemed like a decent compromise. |
This one seems like a big one - I think we need to decide as to whether this gets fixed as part of v2.0 or is noted as a weakness/gap in the standard that is not currently addressed. |
Yeah this is one of the ones on my personal to-do list... |
We now have 4 single-cell 10X studies in ADC, and each of the study's It would be nice if we could fix this (although it means I would need to update a bunch of data) 8-) |
15 single-cell 10X studies in ADC actually, though the studies in the VDJServer repository have not loaded |
OK I'll try to put a PR together for discussion on the March call... |
Yes, I meant there are 4 10X studies with loaded |
If we adapt |
@scharch If I understand what you mean, this is already there with |
No, I mean Cell1 has rearrangements TRB123, TRA456, and TRA789. Cell2 has rearrangements TRB098, TRA765, and TRA432. Do we need to be able to link TRA456 as corresponding to TRA432 vs TRA765 (or even TRB098, though that's easier to code around). |
Sorry, I'm still not understanding. Is the "meaning" of the link to say that those are the "equivalent" chains in two different I guess another way to ask the question is how would you use that link? What problem would it solve for you? |
For the researcher looking at the data? Almost certainly. The question is if we need to make it easy to do by code.
Dunno. I was asking if it was something worth designing around when I'm trying to figure out an updated |
@scharch Even though it might not be in our list of requirements, I'll note that |
It seems to me like you are imagining something entirely different, more a list of inferred naive ancestor across an entire BUT! I think #769 can solve this, too. There, I am proposing representing inferred naive ancestors as "nonphysical" Edit: It's probably not that simple. You'd probably have to iterate through |
Hmm, probably, for T cells at least this is essentially a collapse of (potentially many) rearrangements records into a single record (with a count). Yeah, for B cells, is it a naive ancestral sequence? Or is it a consensus sequence? In either case you are right, it is a computationally inferred sequence ( That's even assuming I care about the sequence. I'm likely thinking about it wrong, as a smaller, more compact representation of data in the rearrangements (though still potentially large), while you are thinking about it as actual biology. |
I think it might matter where we in #768. If |
@javh we have explicitly rejected this approach for |
"virtual" was already taken XD |
I think we'll have that issue regardless, because we only have one |
Closes #778 |
Starting to think about this in the context of generating a lot of 10x VDJ data... it seems we will want to (eventually) have a way for
Clone
s to containcell
s (see #273 (comment)), instead of (or maybe in addition to)Rearrangement
s.Just a marker for now, need to think more about what kind of representation would make sense...
Issues to be resolved:
The text was updated successfully, but these errors were encountered: