TCRemP is a package developed to perform T-cell receptor (TCR) sequence embedding. TCR sequences encode antigen
specificity of T-cells and their repertoire obtained using AIRR-Seq family of technologies serves as a blueprint the individual's adaptive immune system.
In general, it is very challenging to define and measure similarity between TCR sequences that will properly reflect
closeness in antigen recongition profiles. Defining a proper language model for TCRs is also a hard task due to their
immense diversity both in terms of primary sequence organization and in terms of their protein structure.
Our pipeline follows an agnostic approach and vectorizes each TCR based on its similarity to a set of ad hoc chosen
TCR "probes". Thus we follow a prototype-based approach and utilize commonly encountered TCRs either sampled from a
probabilistic V(D)J rearrangement model (see Murugan et al. 2012) or a pool of real-world TCR repertoires to construct a
coordinate system for TCR embedding.
The workflow is the following:
- TCRemP pipeline starts with a selection of
k
prototype TCR alpha and beta sequences, then it computes the distances from every ofn
input TCR alpha-beta pairs to2 * k
prototypes for V, J and CDR3 regions, resulting in6 * k
parameters (or3 * k
for cases when only one of the chains is present).
Distances are computed using local alignment with BLOSUM matrix, as implemented in our mirpy package; we plan to move all computationally-intensive code there.
- Resulting distances are treated as embedding co-ordinates and and are subject to principal component analysis (PCA). One can monitor the information conveyed by each PC, whether they are related to features such as Variable or Joining genes, CDR3 region length or a certain epitope.
N.B. TCRemP is currently in active development, please see below for the list of features, current documentation, a proof-of-concept example. All encountered bugs can be submitted to the
issues
section of the @antigenomics repository.
Using TCRemP one can:
- perform an embedding for a set of T-cell clonotypes, defined by TCR’s Variable (V) and Joining (J) gene IDs and complementarity determining region 3 (CDR3, amino acid sequence placed at the V-J junction). The embedding is performed by mapping those features to real vectors using similarities to a set of prototype TCR sequences
- embed a set of clones, pairs of TCR alpha and beta chain clonotypes
- analyze the mapping by performing dimensionality reduction and evaluating principal components (PCs)
- cluster the embeddings using DBSCAN method with parameter selection using knee/elbow method
- visualize T-cell clone and clonotype embeddings using tSNE, coloring the visualization by user-specified clonotype labels, such as antigen specificities
- infer cluster that are significantly enriched in certain labels, e.g. TCR motifs belonging to CD8+ T-cell subset or specific to an antigen of interest
Planned features:
- [in progress] co-embed samples with VDJdb database to predict TCRs associated with certain antigens, i.e. “annotate” TCR repertoires
- [in progress] perform imputation to correctly handle mixed single-/paired-chain data
- [in progress] implement B-cell receptor (BCR/antibody) prototypes to apply the method to antibody sequencing data
Please cite the tool using the paper:
Yulia Kremlyakova, Elizaveta Vlasova, Daniil Luppov, Mikhail Shugay, TCREMP: a bioinformatic pipeline for efficient embedding of T-cell receptor sequences from immune repertoire and single-cell sequencing data, Journal of Molecular Biology, 2025
(https://doi.org/10.1016/j.jmb.2025.169205)
One can simply install the software out-of-the-box using pip with py3.11:
conda create -n tcremp ipython python=3.11
conda activate tcremp
pip install git+https://github.com/antigenomics/[email protected]
0.0.1-publication
tag corresponds to the version used in the publication TCREMP, JMB, 2025.For the latest version install via the following command:
pip install git+https://github.com/antigenomics/tcremp
Or, in case of package version problems or other issues, clone the repository manually via git, create corresponding conda environment and install directly from sources:
git clone https://github.com/antigenomics/tcremp.git
cd tcremp
conda create -n tcremp ipython python=3.11
conda activate tcremp
pip install .
If the installation doesn't work for Apple M1-M3 processors install the required libraries yourself.
Check the installation by running:
tcremp-run -h # note that first run may be slow
cd $tcremp_repo # where $tcremp_repo is the path to cloned repository
tcremp-run -i data/example/v_tcrpmhc.txt -c TRA_TRB -o data/example/ -n 10 -x clone_id
check that there were no errors and observe the results stored in data/example
folder. You can then go through
the example.ipynb
notebook to run the analysis and visualize the results. You can proceed with your own datasets by
substituting example data with your own properly formatted clonotype tables.
The input data typically consists of a table containing clonotypes as defined above, either TCR alpha, or beta, or both.
One can additionally tag clonotypes/clones with user-defined ids, e.g. cell barcodes, and labels, e.g. antigen
specificity or phenotype. One can also use a custom clonotype table instead of a pre-built set of prototypes (
see data/example/VDJdb_data_paired_example.csv
).
- V and J gene names should be provided based on IMGT naming, e.g.
TRAV35*03
orTRBV11-2
. TCRemP will always use the major allele, so the alleles above will be transformed intoTRBV11-2*01
- The data should not contain any missing data for any of the columns: V, J and CDR3.
- There should be no symbols except for 20 amino acids in CDR3s
Column name | Description | Required |
---|---|---|
clone_id | clonotype id which will be transferred to the output file and which will be used for paired chain data mapping | optional (required for TRA_TRB mode) |
v_call | TCR V gene ID | required |
j_call | TCR J gene ID | required |
junction_aa | TCR CDR3 amino acid sequence | required |
locus | either alpha or beta |
required |
Either wide with missing values
clone_id | junction_aa | v_call | j_call | locus |
---|---|---|---|---|
1 | CASSIRSSYEQYF | TRBV19 | TRBJ2-7 | beta |
2 | CASSWGGGSHYGYTF | TRBV11-2 | TRBJ1-2 | beta |
A simple flat format
clone_id | junction_aa | v_call | j_call | locus |
---|---|---|---|---|
GACTGCGCATCGTCGG-28 | CAGHTGNQFYF | TRAV35 | TRAJ49 | alpha |
GACTGCGCATCGTCGG-28 | CASSWGGGSHYGYTF | TRBV11-2 | TRBJ1-2 | beta |
Run the tool as
tcremp-run --input my_input_data.txt --output my_folder --chain TRA_TRB
The command above will:
- checks input data format and proofreads the dataset
- extracts TCR alpha and beta clonotypes from
my_input_data.txt
- calculates distance scores from clonotypes for the built-in set of
3000
prototypes for each chain
The parameters for running tcremp-run
main script are the following:
parameter | short usage | description | available values | required | default value |
---|---|---|---|---|---|
--input | -i | input clonotype table | path to file | yes | - |
--output | -o | pipeline output folder | path to directory | no | tcremp_{inputfilename}/ |
--prefix | -e | prefix name for distance file | str | no | tcremp_{inputfilename}/ |
--index-col | -x | index column where the clonotype IDs are stored | str | no | tcremp_{inputfilename}/ |
--chain | -c | single or paired clonotype chains | TRA, TRB, TRA_TRB | yes | - |
--prototypes_path | -p | path to the custom input prototype table | path to file | no | data/example/v_tcrpmhc.txt |
--n-prototypes | -n | number of prototypes to be selected for embedding supplemented prototype table | integer | no | None |
--sample-random-prototypes | -sample-random-p | whether to sample the prototypes randomly or not | bool | no | False |
--n-clonotypes | -nc | number of clonotypes to be selected from input file | integer | no | None |
--sample-random-clonotypes | -sample-random-c | whether to sample the clonotypes randomly or not | bool | no | False |
--species | -s | species of built-in prototypes to be used | HomoSapiens, MusMusculus, MacacaMulatta | no | HomoSapiens |
--random-seed | -r | random seed for random prototype selection | integer | no | None |
--nproc | -np | number of processes to perform calculcation with | integer | no | 1 |
--lower-len-cdr3 | -llen | filter out cdr3 with len <llen | integer | no | 30 |
--higher-len-cdr3 | -hlen | filter out cdr3 with len >hlen | integer | no | 30 |
--metrics | -m | which type of matrics to use: similarity or dissimilarity one | similarity, dissimilarity | no | dissimilarity |
--save-dists | -d | whether to save the file with evaluated TCRemP distances or not | bool | no | True |
--cluster | -cl | whether to perform the clustering or not | bool | no | True |
--cluster-pc-components | -npc | number of PCA components for distances dimension reduction | integer | no | 50 |
--cluster-min-samples | -ms | min_samples parameter for DBSCAN used in clonotype clustering | integer | no | 3 |
--k-neightbors | -kn | k-th neighbor parameter for Knee estimation | integer | no | 4 |
If you have a file with TCREmP distances calculated you can separately run the clustering step to adjust it to your data. Run the tool as
tcremp-cluster --input tcremp_distances.tsv --output tcremp_clusters.tsv --components 50 --min_samples 3 --kth_neighbor 4
The output TCRemP file will contain the following columns:
- clone_id - assigned identifier to each row of the input table (either transferred from initial data or generated)
- cdr3aa_{alpha/beta} - cdr3aa sequences for alpha/beta chain
- v_{alpha/beta} - v gene for alpha/beta chain
- j_{alpha/beta} - j gene for alpha/beta chain
- {i}_a_v, {i}_a_j, {i}_a_cdr3 - columns with distances to each alpha prototype
- {i}_b_v, {i}_b_j, {i}_b_cdr3 - columns with distances to each beta prototype
Each line of the output file corresponds to one input clonotype.
Clustering output file will contain the following columns:
- clone_id - assigned identifier to each row of the input table (either transferred from initial data or generated)
- cdr3aa_{alpha/beta} - cdr3aa sequences for alpha/beta chain
- cluster - id of cluster, -1 if a clonotype is an outlier
Basic example of TCRemP usage is running it for VDJdb subsets. The input data for this example can be found in data/example
. The derived embeddings were further visualized using PCA into 50 components and TSNE. The clonotypes are colored by the epitope.
Another example we introduce is the yellow fever vaccination clusters analysis. We merged the day 0 and day 15 datasets and ran TCRemP for the merged set of clonotypes. The clonotypes were further clustered and the enrichment score of each cluster on day 15 was calculated. For more details refer to the initial manuscript.
Various parameters of k - rank of nearest neighbor for DBScan epsilon estimation. The results show that k=4 is the optimal parameter.
We also performed an analysis of the embeddings derived from patient 10X data. For more information on this example refer to the manuscript Figure 2.