Add support for spot-based spatial data and bulk data #583

grst · 2024-12-16T19:46:54Z

In #354 general support for spatial AIRR data is discussed. The purpose of this issue to make a plan what would be required for supporting spot-based spatial AIRR data. Since we don't have single-cell resolution, and therefore no receptor pairing, this is in many ways similar to bulk AIRR data.

IO and Data Structure

reading AIRR-compilant bulk data already works. The cell_id column would need to be abused as a spot identifier.
Maybe support for widely used formats (is there any off-the-shelf spatial AIRR assay yet?)

The AwkwardArray in adata.obsm["airr"] then looks like

[
    # spot0: 
    [
        {"locus": "TRA", "junction_aa": "CADASGT..."},
        {"locus": "TRB", "junction_aa": "CTFDD..."},
        {"locus": "TRB", "junction_aa": "CTFDD..."},
        {"locus": "TRA", "junction_aa": "..."},
        {"locus": "TRB", "junction_aa": "..."},
        ...
    ],
    # spot1
    [
        {"locus": "IGH", "junction_aa": "CDGFFA..."},
        {"locus": "IGH", "junction_aa": "CDGFFA..."},
        {"locus": "TRB", "junction_aa": "CTFDD..."},
        {"locus": "TRA", "junction_aa": "..."},
        {"locus": "TRB", "junction_aa": "..."},
        ...
    ],
    # spot2: no chains
    [],
]

Chain indices

pp.index_chains needs to be adapted, but this should be straightforward. Instead of selecting only two pairs of chains per cell and flagging cells as multi-chain, simply create lists of VJ and VDJ chains.

adata.obsm["chain_indices"] would then look like

[
    # spot0:
    {"VJ": [0, 3, ...], "VDJ": [1, 2, 4, ...]}, 
    # spot1:
    {"VJ": [3, ...], "VDJ": [0, 1, 2, 4, ...]},
    # spot2:
    {"VJ": [], "VDJ": []}
]

Quality control

tl.chain_qc doesn't make any sense for unpaired data. Chains with missing information still need to be removed. Maybe some additional QC metrics could be useful, such as the correlation of receptor chains with T cell fractions from deconvolution.

Clonotype definition

pp.ir_dist() should be straightforward to adapt. It simply needs to take into account all sequences instead of only primary/secondary chain.

tl.define_clonotypes/tl.define_clonotype_clusters would require substantial work. There's two ways how we could imagine "clonotype clusters" in spatial.

(1) We still assign clonotype labels to individual receptor chains. Then each spot would have multiple clonotype labels, where each one has a count. This could be represented as a sparse count matrix, potentially as a MuData layer. Identifying the clonotypes would work very similarly to the current single-cell implementation, but simpler, because we don't have to deal with chain pairing and dual TCRs.

(2) We consider sequence-based distances between spots, defining the AIRR analogy of "niches". It warrants additional discussion what metrics could make sense here, but simple metrics could be
* at least one chain matches between spots (where match means distance(chain1, chain2) < threshold)
* sum of distances between all chains of spots < threshold
* at least one chain of each type (VJ/VDJ) matches between spots

Probably we'd want both. The resulting clonotype labels can be easily visualized on the spatial image and can be used as a variable for spatial algorithms.

Clonotype networks

For (1) described above, the current visualization of clonotype networks could still work well.

For (2), we have a spot x spot distance matrix that we can use to make network plots. I am not sure if the single-cell network plots with separate clonotype components still make sense. Other visualizations like UMAP etc could be explored.

Clonal expansion

tl.clonal_expansion should work, however it would consider if a clonotype occurs in multiple spots, not if the same chain occurs multiple times within the same spot. Maybe different definitions of "expanded" could be explored.

Query reference databases

In the single-cell case, this was just a wrapper around tl.define_clonotypes. Here, it is slightly different, because we want to query a single-cell dataset using bulk/spot data. Each spot would usually get multiple labels because there can be multiple T/B cells in each spot.

Still, the logic is similar as in define_clonotypes: Based on some distance metric, we find all entries in the reference database that have a match below a certain threshold.

Diversity metrics

tbd

Gene usage

tbd

Comparing repertoires

tbd

Comparing with transcriptomics data

Clonotype modularity etc. -> tbd

CC @FFinotello @felixpetschko

The text was updated successfully, but these errors were encountered:

grst mentioned this issue Dec 16, 2024

spatial airr data #354

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for spot-based spatial data and bulk data #583

Add support for spot-based spatial data and bulk data #583

grst commented Dec 16, 2024 •

edited

Loading

Add support for spot-based spatial data and bulk data #583

Add support for spot-based spatial data and bulk data #583

Comments

grst commented Dec 16, 2024 • edited Loading

IO and Data Structure

Chain indices

Quality control

Clonotype definition

Clonotype networks

Clonal expansion

Query reference databases

Diversity metrics

Gene usage

Comparing repertoires

Comparing with transcriptomics data

grst commented Dec 16, 2024 •

edited

Loading