Microbiome HiFi Amplicon Sequencing Simulator Preprint
A complete pipeline for simulating PacBio HiFi amplicon sequencing data for microbiome studies.
Note: Installation requires conda, samtools
git clone https://github.com/rhowardstone/MHASS.git
cd MHASS
bash install_dependencies.shAfter installation:
conda activate mhassNote: For generating the ASV FASTA file required for MHASS ('amplicons.fa', below) from a collection of genomes, you may use our other tool: https://github.com/rhowardstone/AmpliconHunter
For an example script using AmpliconHunter in consort with MHASS, please refer to: https://github.com/rhowardstone/MHASS_evaluation
mhass --amplicon-fasta Test-data_ATCC_16S_MSA-1003/amplicons.fa \
--amplicon-genome-labels Test-data_ATCC_16S_MSA-1003/amplicon_labels.txt \
--barcode-file Test-data_ATCC_16S_MSA-1003/barcodes.tsv --np-distribution-type lognormal \
--output-dir testout This will produce a directory at ./testout, containing the output of the simulation:
counts.tsv: Simulated ASV counts per samplecounts_meta.tsv: Sample library sizessample_barcode_map.tsv: Mapping of samples to barcodessequence_file_mapping.tsv: Detailed template mappingscombined_reads.fastq: Final output FASTQ file
mhass --amplicon-fasta amplicons.fa --amplicon-genome-labels labels.tsv --output-dir output/usage: mhass [-h] --amplicon-fasta AMPLICON_FASTA --amplicon-genome-labels AMPLICON_GENOME_LABELS --output-dir OUTPUT_DIR
[--num-samples NUM_SAMPLES] [--num-reads NUM_READS] [--var-intercept VAR_INTERCEPT] [--var-slope VAR_SLOPE]
[--genome-distribution GENOME_DISTRIBUTION] [--barcode-file BARCODE_FILE] [--subread-accuracy SUBREAD_ACCURACY]
[--difference-ratio DIFFERENCE_RATIO] [--np-distribution-type {empirical,lognormal}] [--lognormal-mu LOGNORMAL_MU]
[--lognormal-sigma LOGNORMAL_SIGMA] [--np-min NP_MIN] [--np-max NP_MAX] [--np-distribution NP_DISTRIBUTION] [--threads THREADS]
options:
-h, --help show this help message and exit
--amplicon-fasta AMPLICON_FASTA
Input FASTA file of amplicons (default: None)
--amplicon-genome-labels AMPLICON_GENOME_LABELS
TSV file mapping amplicons to genomes (asvid<TAB>genomeid) (default: None)
--output-dir OUTPUT_DIR
Output directory for all files (default: None)
--num-samples NUM_SAMPLES
Number of samples to simulate (default: 10)
--num-reads NUM_READS
Number of reads per sample (default: 10000)
--var-intercept VAR_INTERCEPT
Intercept for ASV variability model (controls baseline variation between samples)
(default: 1.47565981333483)
--var-slope VAR_SLOPE
Slope for ASV variability model (how variation changes with abundance)
(default: -0.909890963463704)
--genome-distribution GENOME_DISTRIBUTION
Distribution for genome abundances (uniform, lognormal, powerlaw, or empirical:<file>)
(default: uniform)
--barcode-file BARCODE_FILE
TSV file with barcodes (id, forward, reverse) (default: None)
--subread-accuracy SUBREAD_ACCURACY
Mean subread accuracy used in PBSIM (default: 0.65) (default: 0.65)
--difference-ratio DIFFERENCE_RATIO
Difference (error) ratio for PBSIM (substitution:insertion:deletion).
Default 6:55:39 is for PacBio RS II. Use 22:45:33 for PacBio Sequel, 39:24:36 for ONT (default: 6:55:39)
--np-distribution-type {empirical,lognormal}
Type of np distribution: empirical (from file) or lognormal (default: empirical)
--lognormal-mu LOGNORMAL_MU
mu parameter for lognormal distribution of Number of Passes (default from empirical fit) (default: 3.88)
--lognormal-sigma LOGNORMAL_SIGMA
sigma parameter for lognormal distribution of Num Passes (default from empirical fit) (default: 1.22)
--np-min NP_MIN Minimum np value when using lognormal distribution (default: 2)
--np-max NP_MAX Maximum np value when using lognormal distribution (default: 59)
--np-distribution NP_DISTRIBUTION
TSV file with empirical num-passes distribution (default: None)
--threads THREADS Number of threads to use for parallel processing (default: 190)--amplicon-fasta: Input FASTA file of amplicon reference sequences.--amplicon-genome-labels: TSV file mapping each ASV to a genome ID. Format:asvid<TAB>genomeid.--output-dir: Path to the directory where all output files will be written.
--num-samples: Number of synthetic samples to generate (default:10).--num-reads: Number of reads per sample (default:10000).--var-intercept: Intercept of the abundance variability model across samples (default:1.47565981333483).--var-slope: Slope of the abundance variability model as a function of ASV abundance (default:-0.909890963463704).--genome-distribution: Relative genome abundance distribution model. Options:uniform,lognormal,powerlaw, orempirical:<file.tsv>(default:uniform).--barcode-file: Optional TSV file of barcodes with columns:id,forward,reverse.--subread-accuracy: Mean subread accuracy used for PBSIM3 simulation (default:0.65).--difference-ratio: Difference (error) ratio for PBSIM in formatsubstitution:insertion:deletion. Each value must be 0-1000. Default6:55:39is for PacBio RS II. Use22:45:33for PacBio Sequel,39:24:36for ONT (default:6:55:39).--np-distribution-type: Distribution type for number of passes. Options:empiricalorlognormal(default:empirical).--lognormal-mu:muparameter for lognormal number-of-passes distribution (default:3.88).--lognormal-sigma:sigmaparameter for lognormal number-of-passes distribution (default:1.22).--np-min: Minimum number of passes for lognormal distribution (default:2).--np-max: Maximum number of passes for lognormal distribution (default:59).--np-distribution: TSV file specifying empirical number-of-passes distribution. Required if--np-distribution-typeisempirical.--threads: Number of parallel threads to use for simulation (default:190or all CPUs if unspecified).
>ASV001
ATCGATCGATCGATCG...
asvid genomeid
ASV001 Genome1
ASV002 Genome2
BarcodeID ForwardBarcode ReverseBarcode
BC01 CACATATCAGAGTGCG TGGCGTGCATGATTCGA
num_passes count
2 19
3 23159
...
genomeid proportion
Genome1 0.45
Genome2 0.30
- Count Simulation: Uses
metaSPARSimto simulate ASV abundance across samples based on genome distribution. - Template Creation: Templates are created per ASV×?Sample with barcodes and sampled np values.
- Read Simulation: PBSIM3 simulates PacBio subreads per template with user-defined
--subread-accuracy. - CCS Processing: Subreads are collapsed into CCS reads using
ccs. - Read Relabeling: All CCS reads are relabeled and merged into
combined_reads.fastq. - Cleanup: Intermediate files are cleaned up, leaving only final outputs.
mhass --amplicon-fasta test/amplicons.fa \
--amplicon-genome-labels test/labels.tsv \
--output-dir output/ \
--num-samples 5 \
--num-reads 1000 \
--threads 4 \
--subread-accuracy 0.9mhass --amplicon-fasta test/amplicons.fa \
--amplicon-genome-labels test/labels.tsv \
--output-dir output_sequel/ \
--num-samples 5 \
--num-reads 1000 \
--difference-ratio 22:45:33 \
--threads 4mhass --amplicon-fasta test/amplicons.fa \
--amplicon-genome-labels test/labels.tsv \
--output-dir output_ont/ \
--num-samples 5 \
--num-reads 1000 \
--difference-ratio 39:24:36 \
--threads 4For more usage information, please see our separate evaluation repository: https://github.com/rhowardstone/MHASS_evaluation
TBD ?? please cite appropriately once the tool is published.