Topiary predicts, filters, and ranks MHC-presented peptides from any antigen source. It wraps mhctools binding predictors (NetMHCpan, MHCflurry, etc.) with a composable filtering and ranking DSL, expression-aware prioritization, and wide/long serialization.
Applications include personalized cancer vaccine design, viral epitope mapping, and characterizing T-cell responses.
Requires Python ≥ 3.9.
pip install topiaryFor variant annotation, gene lookups, and SelfProteome, download
Ensembl reference data (any release that matches your reference genome
works — GRCh38 releases are 76+; GRCh37 is release 75):
pyensembl install --release 112 --species humanFor cancer-testis antigen and tissue expression features:
pip install pirlygenesMHC predictors (NetMHCpan, mhcflurry, etc.) are installed
separately — topiary calls them through
mhctools. mhcflurry installs
via pip (pip install mhcflurry && mhcflurry-downloads fetch);
NetMHCpan and other DTU tools require a license from
DTU Health Tech.
The simplest use case: score a handful of peptides against one or more HLA alleles.
from topiary import TopiaryPredictor
from mhctools import NetMHCpan
predictor = TopiaryPredictor(
models=[NetMHCpan],
alleles=["HLA-A*02:01"],
)
df = predictor.predict_from_named_peptides({
"peptide_1": "YLQLVFGIEV",
"peptide_2": "LLFNILGGWV",
"peptide_3": "GILGFVFTL",
})The result is a DataFrame with one row per peptide-allele-kind combination. Each row has score (normalized, higher = better), value (raw prediction, e.g. IC50 nM), and percentile_rank (lower = better).
To scan full-length proteins with a sliding window instead:
df = predictor.predict_from_named_sequences({
"BRAF_V600E": "MAALSGGGGG...LATEKSRWSG",
})Multiple models can run together. The DataFrame will have rows for each model's prediction kinds:
from mhctools import NetMHCpan, MHCflurry
predictor = TopiaryPredictor(
models=[NetMHCpan, MHCflurry],
alleles=["HLA-A*02:01", "HLA-B*07:02"],
)Specify one or more predictors with --mhc-predictor and alleles with --mhc-alleles:
--mhc-predictor netmhcpan mhcflurry --mhc-alleles HLA-A*02:01,HLA-B*07:02All predictors come from mhctools.
With mhctools 3.7.0+, upstream predictor parsing supports multiple
predictors in one CLI invocation, so commands like
--mhc-predictor netmhcpan42 bigmhc-el are supported directly. Topiary keeps
its higher-level --filter-by / --sort-by DSL on top of that lower-level
predictor interface. Topiary's ranking/filtering DSL is also compatible with
the simplified mhctools 3.7.0+ kind constants API. NetChop and Pepsickle
behavior follows the upstream changes as well: improved NetChop error handling
and Pepsickle's epitope-focused model selection.
| CLI name | Predicts |
|---|---|
netmhcpan |
affinity + presentation (auto-detects installed version) |
netmhcpan4 / netmhcpan41 / netmhcpan42 |
specific NetMHCpan 4.x versions |
netmhcpan4-ba / netmhcpan4-el |
single-mode (binding affinity or eluted ligand) |
mhcflurry |
affinity + presentation + processing |
mixmhcpred |
presentation |
netmhciipan / netmhciipan4 / netmhciipan43 |
MHC class II |
bigmhc / bigmhc-el / bigmhc-im |
presentation / immunogenicity |
netmhcstabpan |
pMHC stability |
pepsickle / netchop |
proteasomal cleavage |
netmhcpan-iedb / netmhccons-iedb / smm-iedb |
IEDB web API (no local install) |
random |
random predictions (for testing) |
Peptide lengths: --mhc-epitope-lengths 8,9,10,11 (defaults come from the predictor).
Topiary has an expression language for filtering and ranking predictions. It works identically as Python objects and as CLI strings.
Keep only peptides that pass a threshold:
from topiary import TopiaryPredictor, Affinity, Presentation
from mhctools import NetMHCpan
predictor = TopiaryPredictor(
models=[NetMHCpan],
alleles=["HLA-A*02:01"],
filter_by=Affinity <= 500, # IC50 nM cutoff
)
# Or with presentation score:
predictor = TopiaryPredictor(
models=[NetMHCpan],
alleles=["HLA-A*02:01"],
filter_by=(Affinity <= 500) | (Presentation.rank <= 2.0), # keep if either passes
)| Accessor | Aliases | What it measures |
|---|---|---|
Affinity |
ba, aff, ic50 |
Binding affinity (IC50 nM) |
Presentation |
el |
Eluted ligand / presentation score |
Stability |
pMHC complex stability | |
Processing |
Antigen processing / cleavage |
Each kind has three fields:
.value— raw prediction (e.g. IC50 in nM). Default when no field is specified, soAffinity <= 500meansAffinity.value <= 500..rank— percentile rank (lower = better)..score— normalized score (higher = better).
Sort surviving peptides with sort_by. Transforms normalize heterogeneous scores to a common scale:
from topiary import TopiaryPredictor, Affinity, Presentation, mean
from mhctools import NetMHCpan, MHCflurry
predictor = TopiaryPredictor(
models=[NetMHCpan, MHCflurry],
alleles=["HLA-A*02:01"],
filter_by=Affinity <= 500,
sort_by=mean(
Affinity["netmhcpan"].logistic(350, 150),
Affinity["mhcflurry"].logistic(350, 150),
),
)When multiple models produce the same kind, qualify with brackets: Affinity["netmhcpan"].
Available transforms:
| Transform | What it does |
|---|---|
.descending_cdf(mean, std) |
Lower input -> higher output (for IC50, rank) |
.ascending_cdf(mean, std) |
Higher input -> higher output (for scores) |
.logistic(midpoint, width) |
Sigmoid normalization |
.clip(lo, hi) |
Clamp to range |
.hinge() |
max(0, x) |
.log() / .log2() / .log10() / .log1p() |
Logarithms |
.sqrt() / .exp() |
Square root, exponential |
abs(...) |
Absolute value |
Aggregations: mean(), geomean(), minimum(), maximum(), median()
Every filter and ranking expression has a string form for use at the CLI or in configuration:
| Python | String |
|---|---|
Affinity <= 500 |
affinity <= 500 / ba <= 500 |
Affinity.rank <= 2 |
affinity.rank <= 2 |
Presentation.score >= 0.5 |
el.score >= 0.5 |
(Affinity <= 500) | (Presentation.rank <= 2) |
affinity <= 500 | el.rank <= 2 |
Affinity["netmhcpan"] <= 500 |
netmhcpan_ba <= 500 |
0.5 * Affinity.score + 0.5 * Presentation.score |
0.5 * affinity.score + 0.5 * presentation.score |
Affinity.logistic(350, 150) |
affinity.logistic(350, 150) |
mean(Affinity.score, Presentation.score) |
mean(affinity.score, presentation.score) |
Column("gene_tpm") >= 5 |
gene_tpm >= 5 |
Column("gene_tpm").log() |
gene_tpm.log() |
Any DataFrame column can be used directly in expressions — expression data, peptide properties, custom annotations. Unknown identifiers are automatically treated as column references:
from topiary import Affinity, Column
# As a filter — bare name or Column() both work
filter_by = Column("gene_tpm") >= 5.0
# As a sorting signal
sort_by = 0.5 * Affinity.logistic(350, 150) - 0.1 * Column("cysteine_count")
# Transforms work on columns too
sort_by = Column("gene_tpm").log1p()In CLI strings, bare names work directly:
--filter-by "ba <= 500 & gene_tpm >= 5"
--sort-by "affinity.logistic(350, 150) + 0.1 * gene_tpm.log1p()"The explicit column(name) syntax also works: gene_tpm >= 5.
Misspelled column names get a helpful error at evaluation time: Column 'hydrophobicty' not found. Did you mean: ['hydrophobicity']?
The wt. prefix reads predictions for the wildtype peptide at the same position (variant workflows only). Use it for differential binding:
from topiary import Affinity, wt
sort_by = Affinity.score - wt.Affinity.score # mutant advantageshuffled. and self. prefixes work the same way for shuffled-decoy and self-proteome contexts.
Len() reads the peptide length. Count("C") counts amino acid occurrences:
from topiary import Len, Count, wt
sort_by = Count("C") - wt.Count("C") # gained/lost cysteines vs wildtypeLoad gene-level, transcript-level, or variant-level quantification to annotate predictions. Expression values become DataFrame columns that you can reference in ranking and filtering.
Each --*-expression flag takes a spec string: [name:]file[:id_col[:val_col]].
- The file is auto-detected (Salmon
.sf, Kallistoabundance.tsv, RSEM.genes.results/.isoforms.results, StringTie.gtf, Cufflinks.fpkm_tracking) or treated as generic TSV/CSV. - The loaded columns are joined onto the prediction DataFrame by
gene_id,transcript_id, orvariantrespectively. - Value columns are prefixed with the name (default:
gene,transcript, orvariant) and lowercased. For example, Salmon'sTPMcolumn loaded with--gene-expressionbecomesgene_tpm.
import pandas as pd
from topiary import TopiaryPredictor, Affinity, Column
from topiary.rna import load_expression_from_spec
from mhctools import NetMHCpan
from varcode import load_vcf
# Load expression data
gene_name, gene_id_col, gene_df = load_expression_from_spec(
"salmon_quant.sf", default_name="gene"
)
# gene_df has columns: ["Name", "TPM"]
# gene_name = "gene", gene_id_col = "Name"
expression_data = {
"gene": [(gene_name, gene_id_col, gene_df)],
"transcript": [],
"variant": [],
}
predictor = TopiaryPredictor(
models=[NetMHCpan],
alleles=["HLA-A*02:01"],
filter_by=(Affinity <= 500) & (Column("gene_tpm") >= 5.0),
)
variants = load_vcf("somatic.vcf")
df = predictor.predict_from_variants(
variants,
expression_data=expression_data,
)
# df now has a "gene_tpm" column from the Salmon fileThe name prefix and the original column name are combined as {prefix}_{column}, both lowercased:
| Flag | File | Original column | Result column |
|---|---|---|---|
--gene-expression quant.sf |
Salmon | TPM |
gene_tpm |
--gene-expression quant.sf |
Salmon | NumReads |
gene_numreads |
--transcript-expression abundance.tsv |
Kallisto | tpm |
transcript_tpm |
--variant-expression isovar.tsv |
generic | num_alt_reads |
variant_num_alt_reads |
--gene-expression mygene:quant.sf |
Salmon | TPM |
mygene_tpm |
The join keys are implicit: gene-level data joins on gene_id, transcript-level on transcript_id, variant-level on variant. These columns are present in variant-pipeline output.
When transcript-level expression data is provided, Topiary uses it to select the highest-expressed transcript per variant (instead of the default priority-based selection). This matches the behavior of the legacy --rna-transcript-fpkm-tracking-file flag.
| Extension / pattern | Format | Default ID column | Default value column |
|---|---|---|---|
.sf |
Salmon | Name |
TPM |
abundance*.tsv (with target_id + tpm header) |
Kallisto | target_id |
tpm |
.genes.results |
RSEM (gene) | gene_id |
TPM |
.isoforms.results |
RSEM (transcript) | transcript_id |
TPM |
.gtf |
StringTie | reference_id |
TPM (or FPKM) |
.fpkm_tracking |
Cufflinks | tracking_id |
FPKM |
| anything else | generic | first column | all numeric columns |
Override defaults with the full spec: name:file:id_col:val_col.
topiary \
--peptide-csv peptides.csv \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--output-csv results.csvThe CSV needs a peptide column (and optionally name). Each peptide is scored as-is.
topiary \
--fasta proteins.fasta \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01,HLA-B*07:02 \
--ic50-cutoff 500 \
--output-csv results.csv# Filter: keep if affinity OR presentation passes
topiary \
--fasta proteins.fasta \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--filter-by "ba <= 500 | el.rank <= 2" \
--output-csv results.csv
# Rank: composite score from two models
topiary \
--fasta antigens.fasta \
--mhc-predictor netmhcpan mhcflurry \
--mhc-alleles HLA-A*02:01 \
--filter-by "ba <= 500" \
--sort-by "mean(affinity['netmhcpan'].logistic(350, 150), affinity['mhcflurry'].logistic(350, 150))" \
--output-csv results.csvtopiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01,HLA-B*07:02 \
--filter-by "ba <= 500 | el.rank <= 2" \
--sort-by "0.6 * affinity.descending_cdf(500, 200) + 0.4 * presentation.score" \
--only-novel-epitopes \
--output-csv epitopes.csv# Gene-level expression from Salmon (auto-detected)
topiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--gene-expression salmon_quant.sf \
--filter-by "ba <= 500 & gene_tpm >= 5" \
--only-novel-epitopes \
--output-csv results.csv
# Transcript-level from Kallisto
topiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--transcript-expression kallisto/abundance.tsv \
--filter-by "ba <= 500" \
--sort-by "affinity.descending_cdf(500, 200) + 0.1 * transcript_tpm.log1p()" \
--output-csv results.csv
# Variant-level read support from isovar
topiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--variant-expression isovar_output.tsv \
--filter-by "ba <= 500 & variant_num_alt_reads >= 3" \
--output-csv results.csv
# Combine all three levels
topiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--gene-expression salmon_quant.sf \
--transcript-expression kallisto/abundance.tsv \
--variant-expression isovar_output.tsv \
--filter-by "ba <= 500 & gene_tpm >= 5 & variant_num_alt_reads >= 3" \
--output-csv results.csv
# Override auto-detection with explicit spec
topiary \
--vcf somatic.vcf \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--gene-expression mygene:custom_quant.tsv:ensembl_id:tpm_value \
--output-csv results.csvExpression flags are only valid with the variant pipeline (--vcf, --maf, --variant). They are rejected with direct sequence inputs (--fasta, --peptide-csv, etc.) because those outputs lack the gene_id / transcript_id / variant columns needed for joining.
Pull protein sequences from Ensembl:
topiary --gene-names BRAF TP53 EGFR --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --gene-ids ENSG00000157764 --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --transcript-ids ENST00000288602 --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --ensembl-proteome --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --cta --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01 # cancer-testis antigenstopiary \
--fasta proteins.fasta \
--mhc-predictor netmhcpan \
--mhc-alleles HLA-A*02:01 \
--regions spike:319-541 nucleocapsid:0-50 \
--output-csv results.csvFormat: NAME:START-END (0-indexed, half-open).
Remove peptides found in reference proteomes — for tumor-specific or pathogen-specific peptide selection:
--exclude-ensembl # human Ensembl proteome
--exclude-non-cta # non-CTA proteins (requires pirlygenes)
--exclude-tissues heart_muscle lung # genes expressed in these tissues
--exclude-fasta reference.fasta # custom reference sequences
--exclude-mode substring # "substring" (default) or "exact"--output-csv results.csv
--output-html results.html
--output-csv-sep "\t"
--subset-output-columns peptide allele affinity
--rename-output-column value ic50All predictions: source_sequence_name, peptide, peptide_offset, peptide_length, allele, kind, score, value, affinity, percentile_rank, prediction_method_name
ProteinFragment predictions add (both predict_from_fragments and the variant-based methods that build fragments internally): fragment_id, source_type, overlaps_target, wt_peptide, wt_peptide_length, plus any fragment-level annotations flattened to columns.
Variant predictions additionally add: variant, gene, gene_id, transcript_id, transcript_name, effect, effect_type, contains_mutant_residues, mutation_start_in_peptide, mutation_end_in_peptide
Expression data adds: columns named {prefix}_{column} as described in Column naming, e.g. gene_tpm, transcript_tpm, variant_num_alt_reads.
Compute amino acid properties and use them in ranking:
from topiary import TopiaryPredictor, Affinity, Column
from topiary.properties import add_peptide_properties
df = predictor.predict_from_named_sequences(seqs)
df = add_peptide_properties(df, groups=["manufacturability"])
# Properties become ranking signals
score = Affinity.logistic(350, 150) - 0.1 * Column("cysteine_count")
# CLI: --sort-by "affinity.logistic(350, 150) - 0.1 * cysteine_count"Named groups:
"core"— charge, hydrophobicity, aromaticity, molecular_weight"manufacturability"— core + cysteine_count, instability_index, max_7mer_hydrophobicity, difficult_nterm/cterm, asp_pro_bonds"immunogenicity"— core + tcr_charge, tcr_aromaticity, tcr_hydrophobicity (TCR-facing positions for MHC-I)
from topiary.sources import (
ensembl_proteome,
sequences_from_gene_names,
cta_sequences,
non_cta_sequences,
tissue_expressed_sequences,
available_tissues,
)
# All return dict[name -> amino_acid_sequence]
seqs = sequences_from_gene_names(["BRAF", "TP53", "EGFR"])
cta = cta_sequences() # cancer-testis antigens
heart = tissue_expressed_sequences(["heart_muscle"])
print(available_tissues()) # list tissue namesSkip the live predictor by loading pre-computed scores into a
CachedPredictor, then passing it to TopiaryPredictor(models=…).
Useful for reproducibility, iterating on filters/ranking without
paying the predictor cost, or ingesting predictions from a tool
topiary doesn't natively run.
Load from a prior topiary run (Parquet or TSV):
from topiary import CachedPredictor, TopiaryPredictor
cache = CachedPredictor.from_topiary_output("prior_run.parquet")
predictor = TopiaryPredictor(models=cache)
df = predictor.predict_from_variants(variants)Load from mhcflurry output (all three kinds — affinity, presentation, processing — are preserved):
cache = CachedPredictor.from_mhcflurry("mhcflurry_predictions.csv")Load from NetMHCpan stdout capture (both binding-affinity and
eluted-ligand kinds from a -BA run):
cache = CachedPredictor.from_netmhcpan_stdout("netmhcpan_run.out")Load from a generic TSV/CSV with column mapping (requires a kind
column in the file):
cache = CachedPredictor.from_tsv(
"third_party.tsv",
columns={"affinity": "ic50_nM", "percentile_rank": "rank"},
prediction_method_name="netchop",
predictor_version="3.1",
)Merge shards from parallel prediction jobs:
cache = CachedPredictor.from_directory("caches/", pattern="*.parquet")Every cache holds exactly one (predictor_name, predictor_version)
pair — mixing versions is rejected. See
docs/cached.md for full detail, including
mhcflurry composite versions, cache-plus-fallback mode, and
opt-in version equivalence.
Format is sniffed from file content — no --mhc-cache-format needed
for NetMHC-family, mhcflurry, or topiary-output files:
topiary --peptide-csv peptides.csv \
--mhc-cache-file netmhcpan_run.out \
--output-csv results.csvSharded caches from parallel jobs:
topiary --peptide-csv peptides.csv \
--mhc-cache-directory ./caches \
--output-csv results.csv--mhc-predictor and --mhc-alleles become optional when a cache
supplies predictions.