Skip to content

openvax/topiary

Repository files navigation

Tests Coverage Status PyPI

Topiary

Topiary predicts, filters, and ranks MHC-presented peptides from any antigen source. It wraps mhctools binding predictors (NetMHCpan, MHCflurry, etc.) with a composable filtering and ranking DSL, expression-aware prioritization, and wide/long serialization.

Applications include personalized cancer vaccine design, viral epitope mapping, and characterizing T-cell responses.

Installation

Requires Python ≥ 3.9.

pip install topiary

For variant annotation, gene lookups, and SelfProteome, download Ensembl reference data (any release that matches your reference genome works — GRCh38 releases are 76+; GRCh37 is release 75):

pyensembl install --release 112 --species human

For cancer-testis antigen and tissue expression features:

pip install pirlygenes

MHC predictors (NetMHCpan, mhcflurry, etc.) are installed separately — topiary calls them through mhctools. mhcflurry installs via pip (pip install mhcflurry && mhcflurry-downloads fetch); NetMHCpan and other DTU tools require a license from DTU Health Tech.

Predicting MHC binding

The simplest use case: score a handful of peptides against one or more HLA alleles.

from topiary import TopiaryPredictor
from mhctools import NetMHCpan

predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
)

df = predictor.predict_from_named_peptides({
    "peptide_1": "YLQLVFGIEV",
    "peptide_2": "LLFNILGGWV",
    "peptide_3": "GILGFVFTL",
})

The result is a DataFrame with one row per peptide-allele-kind combination. Each row has score (normalized, higher = better), value (raw prediction, e.g. IC50 nM), and percentile_rank (lower = better).

To scan full-length proteins with a sliding window instead:

df = predictor.predict_from_named_sequences({
    "BRAF_V600E": "MAALSGGGGG...LATEKSRWSG",
})

Multiple models can run together. The DataFrame will have rows for each model's prediction kinds:

from mhctools import NetMHCpan, MHCflurry

predictor = TopiaryPredictor(
    models=[NetMHCpan, MHCflurry],
    alleles=["HLA-A*02:01", "HLA-B*07:02"],
)

MHC prediction models

Specify one or more predictors with --mhc-predictor and alleles with --mhc-alleles:

--mhc-predictor netmhcpan mhcflurry --mhc-alleles HLA-A*02:01,HLA-B*07:02

All predictors come from mhctools.

With mhctools 3.7.0+, upstream predictor parsing supports multiple predictors in one CLI invocation, so commands like --mhc-predictor netmhcpan42 bigmhc-el are supported directly. Topiary keeps its higher-level --filter-by / --sort-by DSL on top of that lower-level predictor interface. Topiary's ranking/filtering DSL is also compatible with the simplified mhctools 3.7.0+ kind constants API. NetChop and Pepsickle behavior follows the upstream changes as well: improved NetChop error handling and Pepsickle's epitope-focused model selection.

CLI name Predicts
netmhcpan affinity + presentation (auto-detects installed version)
netmhcpan4 / netmhcpan41 / netmhcpan42 specific NetMHCpan 4.x versions
netmhcpan4-ba / netmhcpan4-el single-mode (binding affinity or eluted ligand)
mhcflurry affinity + presentation + processing
mixmhcpred presentation
netmhciipan / netmhciipan4 / netmhciipan43 MHC class II
bigmhc / bigmhc-el / bigmhc-im presentation / immunogenicity
netmhcstabpan pMHC stability
pepsickle / netchop proteasomal cleavage
netmhcpan-iedb / netmhccons-iedb / smm-iedb IEDB web API (no local install)
random random predictions (for testing)

Peptide lengths: --mhc-epitope-lengths 8,9,10,11 (defaults come from the predictor).

Filtering and ranking

Topiary has an expression language for filtering and ranking predictions. It works identically as Python objects and as CLI strings.

Filters

Keep only peptides that pass a threshold:

from topiary import TopiaryPredictor, Affinity, Presentation
from mhctools import NetMHCpan

predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
    filter_by=Affinity <= 500,                       # IC50 nM cutoff
)

# Or with presentation score:
predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
    filter_by=(Affinity <= 500) | (Presentation.rank <= 2.0),  # keep if either passes
)

Prediction kinds and fields

Accessor Aliases What it measures
Affinity ba, aff, ic50 Binding affinity (IC50 nM)
Presentation el Eluted ligand / presentation score
Stability pMHC complex stability
Processing Antigen processing / cleavage

Each kind has three fields:

  • .value — raw prediction (e.g. IC50 in nM). Default when no field is specified, so Affinity <= 500 means Affinity.value <= 500.
  • .rank — percentile rank (lower = better).
  • .score — normalized score (higher = better).

Ranking with transforms

Sort surviving peptides with sort_by. Transforms normalize heterogeneous scores to a common scale:

from topiary import TopiaryPredictor, Affinity, Presentation, mean
from mhctools import NetMHCpan, MHCflurry

predictor = TopiaryPredictor(
    models=[NetMHCpan, MHCflurry],
    alleles=["HLA-A*02:01"],
    filter_by=Affinity <= 500,
    sort_by=mean(
        Affinity["netmhcpan"].logistic(350, 150),
        Affinity["mhcflurry"].logistic(350, 150),
    ),
)

When multiple models produce the same kind, qualify with brackets: Affinity["netmhcpan"].

Available transforms:

Transform What it does
.descending_cdf(mean, std) Lower input -> higher output (for IC50, rank)
.ascending_cdf(mean, std) Higher input -> higher output (for scores)
.logistic(midpoint, width) Sigmoid normalization
.clip(lo, hi) Clamp to range
.hinge() max(0, x)
.log() / .log2() / .log10() / .log1p() Logarithms
.sqrt() / .exp() Square root, exponential
abs(...) Absolute value

Aggregations: mean(), geomean(), minimum(), maximum(), median()

String expressions

Every filter and ranking expression has a string form for use at the CLI or in configuration:

Python String
Affinity <= 500 affinity <= 500 / ba <= 500
Affinity.rank <= 2 affinity.rank <= 2
Presentation.score >= 0.5 el.score >= 0.5
(Affinity <= 500) | (Presentation.rank <= 2) affinity <= 500 | el.rank <= 2
Affinity["netmhcpan"] <= 500 netmhcpan_ba <= 500
0.5 * Affinity.score + 0.5 * Presentation.score 0.5 * affinity.score + 0.5 * presentation.score
Affinity.logistic(350, 150) affinity.logistic(350, 150)
mean(Affinity.score, Presentation.score) mean(affinity.score, presentation.score)
Column("gene_tpm") >= 5 gene_tpm >= 5
Column("gene_tpm").log() gene_tpm.log()

Column references

Any DataFrame column can be used directly in expressions — expression data, peptide properties, custom annotations. Unknown identifiers are automatically treated as column references:

from topiary import Affinity, Column

# As a filter — bare name or Column() both work
filter_by = Column("gene_tpm") >= 5.0

# As a sorting signal
sort_by = 0.5 * Affinity.logistic(350, 150) - 0.1 * Column("cysteine_count")

# Transforms work on columns too
sort_by = Column("gene_tpm").log1p()

In CLI strings, bare names work directly:

--filter-by "ba <= 500 & gene_tpm >= 5"
--sort-by "affinity.logistic(350, 150) + 0.1 * gene_tpm.log1p()"

The explicit column(name) syntax also works: gene_tpm >= 5.

Misspelled column names get a helpful error at evaluation time: Column 'hydrophobicty' not found. Did you mean: ['hydrophobicity']?

Scope prefixes

The wt. prefix reads predictions for the wildtype peptide at the same position (variant workflows only). Use it for differential binding:

from topiary import Affinity, wt

sort_by = Affinity.score - wt.Affinity.score  # mutant advantage

shuffled. and self. prefixes work the same way for shuffled-decoy and self-proteome contexts.

Peptide-level expressions

Len() reads the peptide length. Count("C") counts amino acid occurrences:

from topiary import Len, Count, wt

sort_by = Count("C") - wt.Count("C")   # gained/lost cysteines vs wildtype

Expression data

Load gene-level, transcript-level, or variant-level quantification to annotate predictions. Expression values become DataFrame columns that you can reference in ranking and filtering.

How it works

Each --*-expression flag takes a spec string: [name:]file[:id_col[:val_col]].

  1. The file is auto-detected (Salmon .sf, Kallisto abundance.tsv, RSEM .genes.results / .isoforms.results, StringTie .gtf, Cufflinks .fpkm_tracking) or treated as generic TSV/CSV.
  2. The loaded columns are joined onto the prediction DataFrame by gene_id, transcript_id, or variant respectively.
  3. Value columns are prefixed with the name (default: gene, transcript, or variant) and lowercased. For example, Salmon's TPM column loaded with --gene-expression becomes gene_tpm.

Python API

import pandas as pd
from topiary import TopiaryPredictor, Affinity, Column
from topiary.rna import load_expression_from_spec
from mhctools import NetMHCpan
from varcode import load_vcf

# Load expression data
gene_name, gene_id_col, gene_df = load_expression_from_spec(
    "salmon_quant.sf", default_name="gene"
)
# gene_df has columns: ["Name", "TPM"]
# gene_name = "gene", gene_id_col = "Name"

expression_data = {
    "gene": [(gene_name, gene_id_col, gene_df)],
    "transcript": [],
    "variant": [],
}

predictor = TopiaryPredictor(
    models=[NetMHCpan],
    alleles=["HLA-A*02:01"],
    filter_by=(Affinity <= 500) & (Column("gene_tpm") >= 5.0),
)

variants = load_vcf("somatic.vcf")
df = predictor.predict_from_variants(
    variants,
    expression_data=expression_data,
)
# df now has a "gene_tpm" column from the Salmon file

Column naming

The name prefix and the original column name are combined as {prefix}_{column}, both lowercased:

Flag File Original column Result column
--gene-expression quant.sf Salmon TPM gene_tpm
--gene-expression quant.sf Salmon NumReads gene_numreads
--transcript-expression abundance.tsv Kallisto tpm transcript_tpm
--variant-expression isovar.tsv generic num_alt_reads variant_num_alt_reads
--gene-expression mygene:quant.sf Salmon TPM mygene_tpm

The join keys are implicit: gene-level data joins on gene_id, transcript-level on transcript_id, variant-level on variant. These columns are present in variant-pipeline output.

Transcript selection

When transcript-level expression data is provided, Topiary uses it to select the highest-expressed transcript per variant (instead of the default priority-based selection). This matches the behavior of the legacy --rna-transcript-fpkm-tracking-file flag.

Auto-detected formats

Extension / pattern Format Default ID column Default value column
.sf Salmon Name TPM
abundance*.tsv (with target_id + tpm header) Kallisto target_id tpm
.genes.results RSEM (gene) gene_id TPM
.isoforms.results RSEM (transcript) transcript_id TPM
.gtf StringTie reference_id TPM (or FPKM)
.fpkm_tracking Cufflinks tracking_id FPKM
anything else generic first column all numeric columns

Override defaults with the full spec: name:file:id_col:val_col.

CLI usage

Score peptides

topiary \
  --peptide-csv peptides.csv \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --output-csv results.csv

The CSV needs a peptide column (and optionally name). Each peptide is scored as-is.

Scan proteins

topiary \
  --fasta proteins.fasta \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01,HLA-B*07:02 \
  --ic50-cutoff 500 \
  --output-csv results.csv

Filter and rank

# Filter: keep if affinity OR presentation passes
topiary \
  --fasta proteins.fasta \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --filter-by "ba <= 500 | el.rank <= 2" \
  --output-csv results.csv

# Rank: composite score from two models
topiary \
  --fasta antigens.fasta \
  --mhc-predictor netmhcpan mhcflurry \
  --mhc-alleles HLA-A*02:01 \
  --filter-by "ba <= 500" \
  --sort-by "mean(affinity['netmhcpan'].logistic(350, 150), affinity['mhcflurry'].logistic(350, 150))" \
  --output-csv results.csv

Neoantigen discovery from somatic variants

topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01,HLA-B*07:02 \
  --filter-by "ba <= 500 | el.rank <= 2" \
  --sort-by "0.6 * affinity.descending_cdf(500, 200) + 0.4 * presentation.score" \
  --only-novel-epitopes \
  --output-csv epitopes.csv

Add expression data

# Gene-level expression from Salmon (auto-detected)
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --gene-expression salmon_quant.sf \
  --filter-by "ba <= 500 & gene_tpm >= 5" \
  --only-novel-epitopes \
  --output-csv results.csv

# Transcript-level from Kallisto
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --transcript-expression kallisto/abundance.tsv \
  --filter-by "ba <= 500" \
  --sort-by "affinity.descending_cdf(500, 200) + 0.1 * transcript_tpm.log1p()" \
  --output-csv results.csv

# Variant-level read support from isovar
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --variant-expression isovar_output.tsv \
  --filter-by "ba <= 500 & variant_num_alt_reads >= 3" \
  --output-csv results.csv

# Combine all three levels
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --gene-expression salmon_quant.sf \
  --transcript-expression kallisto/abundance.tsv \
  --variant-expression isovar_output.tsv \
  --filter-by "ba <= 500 & gene_tpm >= 5 & variant_num_alt_reads >= 3" \
  --output-csv results.csv

# Override auto-detection with explicit spec
topiary \
  --vcf somatic.vcf \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --gene-expression mygene:custom_quant.tsv:ensembl_id:tpm_value \
  --output-csv results.csv

Expression flags are only valid with the variant pipeline (--vcf, --maf, --variant). They are rejected with direct sequence inputs (--fasta, --peptide-csv, etc.) because those outputs lack the gene_id / transcript_id / variant columns needed for joining.

Gene and transcript lookups

Pull protein sequences from Ensembl:

topiary --gene-names BRAF TP53 EGFR --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --gene-ids ENSG00000157764 --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --transcript-ids ENST00000288602 --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --ensembl-proteome --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01
topiary --cta --mhc-predictor netmhcpan --mhc-alleles HLA-A*02:01  # cancer-testis antigens

Restrict to protein regions

topiary \
  --fasta proteins.fasta \
  --mhc-predictor netmhcpan \
  --mhc-alleles HLA-A*02:01 \
  --regions spike:319-541 nucleocapsid:0-50 \
  --output-csv results.csv

Format: NAME:START-END (0-indexed, half-open).

Exclusion filtering

Remove peptides found in reference proteomes — for tumor-specific or pathogen-specific peptide selection:

--exclude-ensembl                    # human Ensembl proteome
--exclude-non-cta                    # non-CTA proteins (requires pirlygenes)
--exclude-tissues heart_muscle lung  # genes expressed in these tissues
--exclude-fasta reference.fasta      # custom reference sequences
--exclude-mode substring             # "substring" (default) or "exact"

Output

--output-csv results.csv
--output-html results.html
--output-csv-sep "\t"
--subset-output-columns peptide allele affinity
--rename-output-column value ic50

All predictions: source_sequence_name, peptide, peptide_offset, peptide_length, allele, kind, score, value, affinity, percentile_rank, prediction_method_name

ProteinFragment predictions add (both predict_from_fragments and the variant-based methods that build fragments internally): fragment_id, source_type, overlaps_target, wt_peptide, wt_peptide_length, plus any fragment-level annotations flattened to columns.

Variant predictions additionally add: variant, gene, gene_id, transcript_id, transcript_name, effect, effect_type, contains_mutant_residues, mutation_start_in_peptide, mutation_end_in_peptide

Expression data adds: columns named {prefix}_{column} as described in Column naming, e.g. gene_tpm, transcript_tpm, variant_num_alt_reads.

Peptide properties

Compute amino acid properties and use them in ranking:

from topiary import TopiaryPredictor, Affinity, Column
from topiary.properties import add_peptide_properties

df = predictor.predict_from_named_sequences(seqs)
df = add_peptide_properties(df, groups=["manufacturability"])

# Properties become ranking signals
score = Affinity.logistic(350, 150) - 0.1 * Column("cysteine_count")
# CLI: --sort-by "affinity.logistic(350, 150) - 0.1 * cysteine_count"

Named groups:

  • "core" — charge, hydrophobicity, aromaticity, molecular_weight
  • "manufacturability" — core + cysteine_count, instability_index, max_7mer_hydrophobicity, difficult_nterm/cterm, asp_pro_bonds
  • "immunogenicity" — core + tcr_charge, tcr_aromaticity, tcr_hydrophobicity (TCR-facing positions for MHC-I)

Protein sources (Python API)

from topiary.sources import (
    ensembl_proteome,
    sequences_from_gene_names,
    cta_sequences,
    non_cta_sequences,
    tissue_expressed_sequences,
    available_tissues,
)

# All return dict[name -> amino_acid_sequence]
seqs = sequences_from_gene_names(["BRAF", "TP53", "EGFR"])
cta = cta_sequences()                          # cancer-testis antigens
heart = tissue_expressed_sequences(["heart_muscle"])
print(available_tissues())                      # list tissue names

Cached predictions

Skip the live predictor by loading pre-computed scores into a CachedPredictor, then passing it to TopiaryPredictor(models=…). Useful for reproducibility, iterating on filters/ranking without paying the predictor cost, or ingesting predictions from a tool topiary doesn't natively run.

Load from a prior topiary run (Parquet or TSV):

from topiary import CachedPredictor, TopiaryPredictor

cache = CachedPredictor.from_topiary_output("prior_run.parquet")
predictor = TopiaryPredictor(models=cache)
df = predictor.predict_from_variants(variants)

Load from mhcflurry output (all three kinds — affinity, presentation, processing — are preserved):

cache = CachedPredictor.from_mhcflurry("mhcflurry_predictions.csv")

Load from NetMHCpan stdout capture (both binding-affinity and eluted-ligand kinds from a -BA run):

cache = CachedPredictor.from_netmhcpan_stdout("netmhcpan_run.out")

Load from a generic TSV/CSV with column mapping (requires a kind column in the file):

cache = CachedPredictor.from_tsv(
    "third_party.tsv",
    columns={"affinity": "ic50_nM", "percentile_rank": "rank"},
    prediction_method_name="netchop",
    predictor_version="3.1",
)

Merge shards from parallel prediction jobs:

cache = CachedPredictor.from_directory("caches/", pattern="*.parquet")

Every cache holds exactly one (predictor_name, predictor_version) pair — mixing versions is rejected. See docs/cached.md for full detail, including mhcflurry composite versions, cache-plus-fallback mode, and opt-in version equivalence.

From the CLI

Format is sniffed from file content — no --mhc-cache-format needed for NetMHC-family, mhcflurry, or topiary-output files:

topiary --peptide-csv peptides.csv \
    --mhc-cache-file netmhcpan_run.out \
    --output-csv results.csv

Sharded caches from parallel jobs:

topiary --peptide-csv peptides.csv \
    --mhc-cache-directory ./caches \
    --output-csv results.csv

--mhc-predictor and --mhc-alleles become optional when a cache supplies predictions.

About

Predict mutated T-cell epitopes from sequencing data

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors