Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for spot-based spatial data and bulk data #583

Open
grst opened this issue Dec 16, 2024 · 0 comments
Open

Add support for spot-based spatial data and bulk data #583

grst opened this issue Dec 16, 2024 · 0 comments

Comments

@grst
Copy link
Collaborator

grst commented Dec 16, 2024

In #354 general support for spatial AIRR data is discussed. The purpose of this issue to make a plan what would be required for supporting spot-based spatial AIRR data. Since we don't have single-cell resolution, and therefore no receptor pairing, this is in many ways similar to bulk AIRR data.

IO and Data Structure

  • reading AIRR-compilant bulk data already works. The cell_id column would need to be abused as a spot identifier.
  • Maybe support for widely used formats (is there any off-the-shelf spatial AIRR assay yet?)

The AwkwardArray in adata.obsm["airr"] then looks like

[
    # spot0: 
    [
        {"locus": "TRA", "junction_aa": "CADASGT..."},
        {"locus": "TRB", "junction_aa": "CTFDD..."},
        {"locus": "TRB", "junction_aa": "CTFDD..."},
        {"locus": "TRA", "junction_aa": "..."},
        {"locus": "TRB", "junction_aa": "..."},
        ...
    ],
    # spot1
    [
        {"locus": "IGH", "junction_aa": "CDGFFA..."},
        {"locus": "IGH", "junction_aa": "CDGFFA..."},
        {"locus": "TRB", "junction_aa": "CTFDD..."},
        {"locus": "TRA", "junction_aa": "..."},
        {"locus": "TRB", "junction_aa": "..."},
        ...
    ],
    # spot2: no chains
    [],
]

Chain indices

pp.index_chains needs to be adapted, but this should be straightforward. Instead of selecting only two pairs of chains per cell and flagging cells as multi-chain, simply create lists of VJ and VDJ chains.

adata.obsm["chain_indices"] would then look like

[
    # spot0:
    {"VJ": [0, 3, ...], "VDJ": [1, 2, 4, ...]}, 
    # spot1:
    {"VJ": [3, ...], "VDJ": [0, 1, 2, 4, ...]},
    # spot2:
    {"VJ": [], "VDJ": []}
]

Quality control

tl.chain_qc doesn't make any sense for unpaired data. Chains with missing information still need to be removed. Maybe some additional QC metrics could be useful, such as the correlation of receptor chains with T cell fractions from deconvolution.

Clonotype definition

pp.ir_dist() should be straightforward to adapt. It simply needs to take into account all sequences instead of only primary/secondary chain.

tl.define_clonotypes/tl.define_clonotype_clusters would require substantial work. There's two ways how we could imagine "clonotype clusters" in spatial.

(1) We still assign clonotype labels to individual receptor chains. Then each spot would have multiple clonotype labels, where each one has a count. This could be represented as a sparse count matrix, potentially as a MuData layer. Identifying the clonotypes would work very similarly to the current single-cell implementation, but simpler, because we don't have to deal with chain pairing and dual TCRs.

(2) We consider sequence-based distances between spots, defining the AIRR analogy of "niches". It warrants additional discussion what metrics could make sense here, but simple metrics could be
* at least one chain matches between spots (where match means distance(chain1, chain2) < threshold)
* sum of distances between all chains of spots < threshold
* at least one chain of each type (VJ/VDJ) matches between spots

Probably we'd want both. The resulting clonotype labels can be easily visualized on the spatial image and can be used as a variable for spatial algorithms.

Clonotype networks

For (1) described above, the current visualization of clonotype networks could still work well.

For (2), we have a spot x spot distance matrix that we can use to make network plots. I am not sure if the single-cell network plots with separate clonotype components still make sense. Other visualizations like UMAP etc could be explored.

Clonal expansion

tl.clonal_expansion should work, however it would consider if a clonotype occurs in multiple spots, not if the same chain occurs multiple times within the same spot. Maybe different definitions of "expanded" could be explored.

Query reference databases

In the single-cell case, this was just a wrapper around tl.define_clonotypes. Here, it is slightly different, because we want to query a single-cell dataset using bulk/spot data. Each spot would usually get multiple labels because there can be multiple T/B cells in each spot.

Still, the logic is similar as in define_clonotypes: Based on some distance metric, we find all entries in the reference database that have a match below a certain threshold.

Diversity metrics

tbd

Gene usage

tbd

Comparing repertoires

tbd

Comparing with transcriptomics data

Clonotype modularity etc. -> tbd

CC @FFinotello @felixpetschko

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant