StrainAMR is a learning-based framework for predicting antimicrobial resistance (AMR) from bacterial genomes while exposing the genetic features that drive resistance. The toolkit combines k‑mers, single nucleotide variants (SNVs) and protein clusters to build an interpretable classifier and highlight meaningful feature pairs through attention and SHAP interaction scores.
- Accurate AMR prediction for bacterial strains from raw FASTA assemblies
- Biologically interpretable feature discovery using attention weights and SHAP interaction values
- Parallel genome processing with configurable thread count
- Token-to-feature mapping to translate model inputs back to genes, k‑mers and SNVs
- RGI-informed SNV annotation providing AMR gene family context in SHAP outputs
git clone https://github.com/liaoherui/StrainAMR.git
cd StrainAMR
unzip Test_genomes.zip
unzip localDB.zip
unzip Benchmark_features.zip
Install the helper utility:
pip install gdown
Option 1 — build from strainamr.yaml
:
conda env create -f strainamr.yaml
conda activate strainamr
Option 2 — use the pre-built environment (recommended):
sh download_env.sh
source strainamr/bin/activate
sh download_ps.sh
python install_rebuild_ps.py
Add PhenotypeSeeker to your PATH
(replace /path/to/StrainAMR
with your directory):
echo "export PATH=\$PATH:/path/to/StrainAMR/PhenotypeSeeker/.PSenv/bin" >> ~/.bashrc
source ~/.bashrc
Check the command-line interfaces:
python StrainAMR_build_train.py -h
python StrainAMR_build_test.py -h
python StrainAMR_model_train.py -h
python StrainAMR_model_predict.py -h
Run the end‑to‑end demo on the bundled test genomes:
sh test_run.sh
Reproduce the three‑fold cross‑validation experiment from the paper:
sh batch_train_3fold_exp.sh
Flag | Default | Description |
---|---|---|
-i , --input_file |
required | Directory containing training genome FASTA files |
-l , --label_file |
required | Path to phenotype label file |
-d , --drug |
required | Drug name to model |
-p , --pc |
0 |
Skip protein-cluster token generation when set to 1 |
-s , --snv |
0 |
Skip SNV token generation when set to 1 |
-k , --kmer |
0 |
Skip k‑mer token generation when set to 1 |
-t , --threads |
1 |
Number of parallel worker processes |
-o , --outdir |
StrainAMR_res |
Output directory for generated features |
Flag | Default | Description |
---|---|---|
-i , --input_file |
required | Directory containing test genome FASTA files |
-l , --label_file |
required | Path to phenotype label file for the test data |
-d , --drug |
required | Drug name to model; must match training data |
-p , --pc |
0 |
Skip protein-cluster token generation when set to 1 |
-s , --snv |
0 |
Skip SNV token generation when set to 1 |
-k , --kmer |
0 |
Skip k-mer token generation when set to 1 |
-t , --threads |
1 |
Number of parallel worker processes |
-o , --outdir |
required | Output directory; should match training output directory |
Flag | Default | Description |
---|---|---|
-i , --input_file |
required | Directory produced by build scripts containing token files |
-f , --feature_used |
all |
Comma-separated list of features to use (kmer , snv , pc ) |
-t , --train_mode |
0 |
Set to 1 if only training data are provided |
-s , --save_mode |
1 |
0 saves model with minimum validation loss |
-a , --attention_weight |
1 |
0 disables saving attention matrices |
-o , --outdir |
StrainAMR_fold_res |
Directory for models, logs and SHAP outputs |
Flag | Default | Description |
---|---|---|
-i , --input_file |
required | Directory of feature files for prediction |
-f , --feature_used |
all |
Feature types to use (kmer , snv , pc ) |
-m , --model_PATH |
required | Directory containing pre-trained models |
-o , --outdir |
StrainAMR_fold_res |
Directory for logs, SHAP results and analysis outputs |
- Feature extraction (
StrainAMR_build_train.py
/StrainAMR_build_test.py
)- Token files such as
strains_*_sentence_fs.txt
,strains_*_pc_token_fs.txt
,strains_*_kmer_token.txt
- Mapping files (
node_token_match.txt
,kmer_token_id.txt
) linking token IDs to genomic features - SHAP-filtered feature lists (
*_shap_filter.txt
) shap/
– SHAP value tables with token IDs mapped to genes or SNV positions, including AMR gene family annotations for SNV features
- Token files such as
- Model training (
StrainAMR_model_train.py
)- Results are grouped into subfolders within the specified
--outdir
models/
– checkpoints such asbest_model_f1_score.pt
logs/
– training logs and per-sample probability outputsshap/
– SHAP interaction pair files (strains_train_*_interaction.txt
) and the SHAP tables copied from feature extractionanalysis/
– attention-weight graphs and top-token tables
- Results are grouped into subfolders within the specified
- Prediction (
StrainAMR_model_predict.py
)- Results saved under the specified
--outdir
logs/
– prediction summaries and per-sample probabilitiesshap/
– SHAP value tables and interaction scores for test genomes with feature namesanalysis/
– attention-weight graphs and top-token tables for predictions
- Results saved under the specified
StrainAMR_build_train.py
andStrainAMR_build_test.py
accept--threads
to process genomes in parallel- Model training computes SHAP interaction values and maps token IDs back to genomic features for improved interpretability
- SNV SHAP tables and attention-token reports include AMR gene family annotations derived from RGI outputs
If you use StrainAMR in your research, please cite:
Liao et al. StrainAMR: ... (2024)