A unified interface for genomic sequence oracles - deep learning models that predict genomic regulatory activity from DNA sequences.
Chorus provides a consistent, easy-to-use API for working with state-of-the-art genomic deep learning models including:
- Enformer: Predicts gene expression and chromatin states from DNA sequences
- Borzoi: Enhanced model for regulatory genomics predictions
- ChromBPNet: Predicts TF binding and chromatin accessibility at base-pair resolution
- Sei: Sequence regulatory effect predictions across 21,907 chromatin profiles
- LegNet: Regulatory regions activity prediction using models trained on MPRA data
Key features:
- 🧬 Unified API across different models
- 📊 Built-in visualization tools for genomic tracks
- 🔬 Variant effect prediction
- 🎯 In silico mutagenesis and sequence optimization
- 📈 Track normalization and comparison utilities
- 🚀 Enchanced sequence editing logic
- 🔧 NEW: Isolated conda environments for each oracle to avoid dependency conflicts
Currently, Enformer, Sei, Borzoi, ChromBPNet and LegNet oracles is fully implemented with:
- Environment isolation support
- Reference genome integration for biologically accurate predictions
- ENCODE track identifier support
- BedGraph output generation
# Clone the repository
git clone https://github.com/pinellolab/chorus.git
cd chorus
#create main chorus env
mamba env create -f environment.yml
mamba activate chorus
# Install chorus package
pip install -e .Chorus uses isolated conda environments for each oracle to avoid dependency conflicts between TensorFlow and PyTorch models:
# Set up Enformer environment (TensorFlow-based)
chorus setup --oracle enformer
# List available environments
chorus listYou can check the correctness of installation using the following command
# Check environment health
chorus health --timeout 300Note: If you haven’t used Oracle yet, it will need some time to download its weights.
Chorus includes built-in support for downloading and managing reference genomes:
# List available genomes
chorus genome list
# Download a reference genome (e.g., hg38, hg19, mm10)
chorus genome download hg38
# Get information about a downloaded genome
chorus genome info hg38
# Remove a downloaded genome
chorus genome remove hg38Supported genomes:
- hg38: Human genome assembly GRCh38
- hg19: Human genome assembly GRCh37
- mm10: Mouse genome assembly GRCm38
- mm9: Mouse genome assembly NCBI37
- dm6: Drosophila melanogaster genome assembly BDGP6
- ce11: C. elegans genome assembly WBcel235
Genomes are stored in the genomes/ directory within your Chorus installation.
import chorus
from chorus.utils import get_genome
# Create oracle with reference genome (auto-downloads if needed)
genome_path = get_genome('hg38')
oracle = chorus.create_oracle('enformer',
use_environment=True,
reference_fasta=str(genome_path))
oracle.load_pretrained_model()
# Define tracks to predict (ENCODE IDs or descriptions)
tracks = ['ENCFF413AHU', 'CNhs11250'] # DNase:K562, CAGE:K562By default, Chorus auto-detects and uses GPU if available. You can explicitly control device selection:
# Force CPU usage (useful for testing or GPU memory issues)
oracle = chorus.create_oracle('enformer',
use_environment=True,
reference_fasta=str(genome_path),
device='cpu')
# Use specific GPU (for multi-GPU systems)
oracle = chorus.create_oracle('enformer',
use_environment=True,
reference_fasta=str(genome_path),
device='cuda:1') # Use second GPU
# Set default device via environment variable
# export CHORUS_DEVICE=cpuFor slower systems or CPU-only environments, you may need to adjust timeouts:
# Custom timeouts for slower systems
oracle = chorus.create_oracle('enformer',
use_environment=True,
reference_fasta=str(genome_path),
model_load_timeout=1200, # 20 minutes
predict_timeout=600) # 10 minutes
# Combine device and timeout settings
oracle = chorus.create_oracle('enformer',
use_environment=True,
reference_fasta=str(genome_path),
device='cpu', # Force CPU
model_load_timeout=1800, # 30 minutes for CPU
predict_timeout=900) # 15 minutes for CPU
# Disable all timeouts (use with caution)
oracle = chorus.create_oracle('enformer',
use_environment=True,
reference_fasta=str(genome_path),
model_load_timeout=None,
predict_timeout=None)
# Or set environment variable to disable all timeouts globally
# export CHORUS_NO_TIMEOUT=1# Predict from genomic coordinates
predictions = oracle.predict(
('chr11', 5247000, 5248000), # Beta-globin locus
tracks
)
# Or from DNA sequence
sequence = 'ACGT' * 98304 # 393,216 bp for Enformer
predictions = oracle.predict(sequence, tracks)# Replace a 200bp region with enhancer sequence
enhancer = 'GATA' * 50 # 200bp GATA motif repeats
replaced = oracle.predict_region_replacement(
'chr11:5247400-5247600', # Region to replace
enhancer, # New sequence
tracks
)# Insert enhancer at specific position
inserted = oracle.predict_region_insertion_at(
'chr11:5247500', # Insertion point
enhancer, # Sequence to insert
tracks
)# Test SNP effects (e.g., A→G mutation)
variant_effects = oracle.predict_variant_effect(
'chr11:5247000-5248000', # Region containing variant
'chr11:5247500', # Variant position
['A', 'G', 'C', 'T'], # Reference first, then alternates
tracks
)# Save as BedGraph for genome browser
wt_files = predictions.save_predictions_as_bedgraph(output_dir="bedgraph_outputs",
prefix='a_wt')For a detailed walkthrough with visualizations and gene annotations, see the comprehensive notebook:
# Download reference genome and gene annotations
chorus genome download hg38
# Run the comprehensive notebook
jupyter notebook examples/gata1_comprehensive_analysis.ipynbThis notebook demonstrates:
- All prediction methods with real genomic data
- Gene annotation and visualization
- Saving outputs for genome browsers
- Performance tips and best practices
Each oracle runs in its own conda environment to avoid dependency conflicts:
# TensorFlow-based Enformer runs in isolated environment
enformer = chorus.create_oracle('enformer', use_environment=True)
# Future: PyTorch-based models will have their own environments
borzoi = chorus.create_oracle('borzoi', use_environment=True) # Coming soonFor accurate predictions, provide a reference genome to extract proper flanking sequences:
# Enformer requires 393,216 bp of context
# Chorus automatically extracts and pads sequences from the reference
# Option 1: Using get_genome() - simplest approach
from chorus.utils import get_genome
genome_path = get_genome('hg38') # Auto-downloads if not present
oracle = chorus.create_oracle('enformer',
use_environment=True,
reference_fasta=str(genome_path))
# Option 2: Using GenomeManager directly
from chorus.utils import GenomeManager
gm = GenomeManager()
genome_path = gm.get_genome('hg38') # Auto-downloads if needed
oracle = chorus.create_oracle('enformer',
use_environment=True,
reference_fasta=str(genome_path))
# Predict using genomic coordinates
predictions = oracle.predict(('chr1', 1000000, 1001000), ['DNase:K562'])Note: ENCODE track identifiers and cell type descriptions are specific to Enformer model. Other oracles may use different track naming conventions.
For Enformer:
# Using ENCODE identifier (recommended for reproducibility)
predictions = oracle.predict(sequence, ['ENCFF413AHU']) # Specific DNase:K562 experiment
# Using descriptive name
predictions = oracle.predict(sequence, ['DNase:K562'])
# Using CAGE identifiers
predictions = oracle.predict(sequence, ['CNhs11250']) # CAGE:K562For other oracles (Borzoi, ChromBPNet, Sei, etc.), track specifications will vary based on the model's training data.
Predictions can be saved as BedGraph tracks for genome browser visualization:
# Predictions are returned as numpy arrays
# Each bin represents 128 bp for Enformer
# See examples for BedGraph generation codeOracles are deep learning models that predict genomic regulatory activity. Each oracle implements a common interface while running in isolated environments.
Class as a unified interface to the reference genome/sequence. This component enables structured access to genomic coordinates while explicitly tracking and managing sequence edits together with their corresponding model predictions, thereby supporting reproducible in silico perturbation workflows and consistent downstream analysis.
Tracks represent genomic signal data (e.g., DNase-seq, ChIP-seq). Enformer predicts 5,313 human tracks covering various assays and cell types.
The chorus CLI manages conda environments for each oracle:
# Set up environments
chorus setup --oracle enformer
# Check health
chorus health
# Clean up
chorus remove --oracle enformerEnformer [6] is a hybrid convolutional–transformer architecture designed for long-range sequence-to-function modeling of regulatory genomics, with the primary goal of predicting transcriptional and epigenomic activity directly from DNA sequence.
- Sequence length: 393,216 bp input, 114,688 bp output window
- Output: 896 bins × 5,313 tracks
- Bin size: 128 bp
- Track types: Gene expression (CAGE), chromatin accessibility (DNase/ATAC), histone modifications (ChIP-seq)
- Track identifiers:
- ENCODE IDs (e.g., ENCFF413AHU for DNase:K562)
- CAGE IDs (e.g., CNhs11250 for CAGE:K562)
- Descriptive names (e.g., 'DNase:K562', 'H3K4me3:HepG2')
- Track metadata: Included in the package (file with all 5,313 human track definitions)
Enhanced Enformer with improved performance and RNA-tracks predictions.
- Sequence length: 393,216 bp input, 114,688 bp output window
- Output: 896 bins × 5,313 tracks
- Bin size: 128 bp
- Track types: Gene expression (CAGE, RNA-Seq), chromatin accessibility (DNase/ATAC), histone modifications (ChIP-seq)
- Track identifiers:
- ENCODE IDs (e.g., ENCFF413AHU for DNase:K562)
- CAGE IDs (e.g., CNhs11250 for CAGE:K562)
- Descriptive names (e.g., 'DNase:K562', 'H3K4me3:HepG2')
- Track metadata: Included in the package (file with all 7,610 human track definitions)
Base-pair resolution for chromatin accessibility and TF binding predictions (uses TF-specific tracks)
- Sequence length: 2114 bp input
- Output: 1000 bins
- Bin size: 1 bp
- Track types: DNase accessibility, TF binding (CHIP-Seq)
- Track identifiers:
- ENCODE IDs (e.g., ENCFF574YLK for DNase:K562)
Sequence regulatory effect predictions (uses custom track naming for 21,907 profiles)
- Sequence length: 4096 bp input
- Output: 1 bin
- Bin size: 4096 bp
- Track types: DNase accessibility, TF binding (CHIP-Seq), histone modifications
- Track identifiers:
- custom Sei track identifiers
- Track metadata: Included in the package (files with all 21907 human track definitions and 41 Sei-defined classes)
LegNet is a fully convolutional neural network designed for efficient modeling of short regulatory DNA sequences.
- Sequence length: 200 bp input
- Output: 1 bin
- Bin size: 200 bp
- Track types: Element activity in MPRA experiment
- Track identifiers:
- cell line names
If you encounter timeout errors on slower systems:
# Increase timeouts
oracle = chorus.create_oracle('enformer',
use_environment=True,
model_load_timeout=1800, # 30 minutes
predict_timeout=900) # 15 minutes
# Or disable timeouts entirely
export CHORUS_NO_TIMEOUT=1Common timeout scenarios:
- Model loading: First-time downloads can be slow (~1GB model)
- CPU predictions: GPU is 10-100x faster than CPU
- Network filesystems: Add 50% to timeouts for NFS/shared storage
# Check if environment exists
chorus health
# Recreate environment
chorus remove --oracle enformer
chorus setup --oracle enformerSome oracles require a significant memory (~8-16 GB) for predictions. Solutions:
- Force CPU usage:
device='cpu' - Use a different GPU:
device='cuda:1' - Reduce batch size if needed
The isolated environments include GPU support. Ensure CUDA is properly installed on your system.
To check GPU availability:
# In your Python environment
import tensorflow as tf
print(f"GPUs available: {tf.config.list_physical_devices('GPU')}")To force CPU usage when GPU causes issues:
oracle = chorus.create_oracle('enformer',
use_environment=True,
device='cpu')We welcome contributions! Areas needing work:
- Add more examples and tutorials
- Implement batch prediction optimizations
- Add more visualization utilities
- Add more oracles
We've designed Chorus to make it easy to add new genomic prediction models. Each oracle runs in its own isolated conda environment, avoiding dependency conflicts between different frameworks (TensorFlow, PyTorch, JAX, etc.).
For detailed instructions on implementing a new oracle, see our Contributing Guide.
Key steps:
- Inherit from
OracleBaseand implement required methods - Define your conda environment configuration
- Use the environment isolation system for model loading and predictions
- Add tests and example notebooks
- Submit a PR with your implementation
The contributing guide includes complete code examples and templates to get you started.
If you use Chorus in your research, please cite:
@software{chorus2026,
title = {Chorus: A unified interface for genomic sequence oracles},
author = {Dmitry Penzar , Lorenzo Ruggeri , Rosalba Giugno, Luca Pinello},
year = {2026},
url = {https://github.com/pinellolab/chorus}
}This project is licensed under the MIT License - see the LICENSE file for details.
Chorus integrates several groundbreaking models:
- Enformer (Avsec et al., 2021)
- Borzoi (Linder et al., 2023)
- ChromBPNet (Agarwal et al., 2021)
- Sei (Chen et al., 2022)
- LegNet (Penzar et al., 2023)
For vizualization tasks we extensively use coolbox package