A comprehensive Docker‑friendly pipeline for bulk ATAC‑seq analysis using Chromap. It produces Y‑chromosome readnames, Tn5‑shifted bigWigs, coordinate‑sorted BAMs (+BAI), MACS2/MACS3 peaks, and ChromBPNet-compatible preprocessing outputs.
What it does:
- Auto‑pairs R1/R2 FASTQs
- Extracts Y‑chromosome evidence (RNAME/RNEXT/SA:Z) →
<sample>.y_readnames.txt - Generates bigWigs from BAM with ATAC Tn5 shifts (+4/−5) and bounded memory
- Writes coordinate‑sorted
<sample>.bam(+.bai) and a per‑sample summary - Calls peaks with MACS2 and MACS3 (
*.narrowPeak, logs, xls, summits) - Optionally produces TLEN‑derived fragments (
--generate-TLEN-fragments)
- Single‑pass mapping, parallel consumers (BAM sort + Y names)
- Bounded memory sorts (
SORT_MEM,SAMTOOLS_SORT_MEM) and disk spill (TMPDIR) - Parallel per‑sample processing (
JOBS) and per‑sample threads (THREADS) - Docker image includes bedtools, samtools, MACS2/3, bedGraphToBigWig
- Manifest TSV tracks all outputs
- Docker (or run natively with bedtools, samtools, macs2, macs3, bedGraphToBigWig)
- Reference FASTA, Chromap index, chrom.sizes (consistent contig naming)
# Build image (one time)
./build_chromap_container.sh
# Real run: 4 concurrent samples × 4 Chromap threads
FASTQ_DIR=/storage/ATAC-Seq \
OUT_DIR=/storage/ATAC-Seq/output \
OUTPUT_BAM_DIR=/mnt/pikachu/ATAC-results \
THREADS=4 JOBS=4 \
./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs
# Optional: also write TLEN fragments
FASTQ_DIR=/storage/ATAC-Seq OUT_DIR=/storage/ATAC-Seq/output OUTPUT_BAM_DIR=/mnt/pikachu/ATAC-results \
THREADS=4 JOBS=4 ./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs --generate-TLEN-fragments
# Optional: blacklist post‑filtering of MACS peaks
BLACKLIST_BED=/path/to/blacklist.bed \
FASTQ_DIR=/storage/ATAC-Seq OUT_DIR=/storage/ATAC-Seq/output OUTPUT_BAM_DIR=/mnt/pikachu/ATAC-results \
THREADS=4 JOBS=4 ./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs| Variable | Default | Description |
|---|---|---|
FASTQ_DIR |
/storage/ATAC-Seq |
Input FASTQs |
OUT_DIR |
/storage/ATAC-Seq/output |
Staging + manifest |
OUTPUT_BAM_DIR |
/mnt/pikachu/ATAC-Seq-output |
Final outputs (moved after MACS) |
REF_FASTA |
/storage/scRNAseq_output/indices-110-44/chromap/genome.fa |
Reference FASTA |
INDEX_FILE |
/storage/scRNAseq_output/indices-110-44/chromap/index |
Chromap index |
CHROM_SIZES |
/storage/scRNAseq_output/indices-110-44/chromap/chrNameLength.txt |
Chrom sizes |
THREADS |
4 |
Chromap threads per sample |
JOBS |
4 |
Concurrent samples |
DRYRUN |
0 |
If 1, print heavy steps only |
GENOME_SIZE |
hs |
MACS genome size (e.g., hs, mm) |
SORT_MEM |
512M |
Memory per GNU sort stage |
SAMTOOLS_SORT_MEM |
512M |
Memory per thread for samtools sort |
SORT_MEM_YNAMES |
64M |
Memory for sort -u when deduplicating Y readnames |
SAVE_BEDGRAPH |
0 |
If 1, keep <sample>_unstranded.bedGraph |
BLACKLIST_BED |
unset | Path to blacklist BED for post‑filtering |
Core flags:
--samples=SID1,SID2,...: Process a subset--number_of_jobs=N: Concurrency (alt toJOBSenv)--deterministic-bam: Single‑threaded BAM sort for bitwise stability--macs-gsize=STR: OverrideGENOME_SIZE--generate-TLEN-fragments: Write TLEN‑derived fragments (3‑col)
Y‑chromosome:
--y-contig-labels=chrY,Y,...: Override Y contigs (auto‑detected otherwise)--y-names-only/--y-only: Enable Y‑names extraction (default on)
BigWigs:
- Generated from BAM via bedtools/awk/genomecov with +4/−5 Tn5 shifts. Set
SAVE_BEDGRAPH=1to keep the bedGraph.
Legacy/deprecated:
- ChromBPNet helper flags (
--ChromBPNet,--per-sample-shift,--no-bw-fallback) are no longer used. --remove-Ydeprecated in Y‑only mode.
OUTPUT_BAM_DIR/<target>/<timepoint>/<sample_id>/
├── <sample>.bam # Coordinate‑sorted BAM
├── <sample>.bam.bai # BAM index
├── <sample>_unstranded.bw # BigWig (Tn5 +4/−5)
├── <sample>.y_readnames.txt # Y readnames
├── <sample>.fragments.bed # Optional (with --generate-TLEN-fragments)
├── <sample>.macs2_peaks.narrowPeak # MACS2 peaks
├── <sample>.macs3_peaks.narrowPeak # MACS3 peaks
├── <sample>.macs2.log # MACS2 log
├── <sample>.macs3.log # MACS3 log
├── <sample>.macs2_peaks.xls # MACS2 xls
├── <sample>.macs3_peaks.xls # MACS3 xls
├── <sample>.macs2_summits.bed # MACS2 summits
├── <sample>.macs3_summits.bed # MACS3 summits
└── <sample>.summary.txt # Chromap summary
When using create_helper_files.sh:
BAM_DIRECTORY/
├── <sample>.bam # Input BAM file
├── <sample>.bw # Generated BigWig (Tn5-shifted)
└── <sample>.fragments.bed.gz # Generated fragments file (gzipped)
When using generate_peaks.sh:
BAM_DIRECTORY/
├── <sample>.bam # Input BAM file
├── <sample>.macs2_peaks.narrowPeak # MACS2 narrow peaks
├── <sample>.macs2_summits.bed # MACS2 peak summits
├── <sample>.macs2_peaks.xls # MACS2 detailed results
├── <sample>.macs2.log # MACS2 execution log
├── <sample>.macs3_peaks.narrowPeak # MACS3 narrow peaks (if enabled)
├── <sample>.macs3_summits.bed # MACS3 peak summits (if enabled)
├── <sample>.macs3_peaks.xls # MACS3 detailed results (if enabled)
├── <sample>.macs3.log # MACS3 execution log (if enabled)
├── <sample>.macs2_peaks.filtered.narrowPeak # Blacklist-filtered MACS2 peaks (if blacklist provided)
└── <sample>.macs3_peaks.filtered.narrowPeak # Blacklist-filtered MACS3 peaks (if blacklist provided)
- Primary Pipeline: Files are moved to
OUTPUT_BAM_DIRonly after bigWig and MACS2/3 complete (or error out) - Additional Scripts: Output files are created in-place within the input BAM directory
- Blacklist Filtering: Applied automatically if blacklist BED file is provided
- Optional Files: If
SAVE_BEDGRAPH=1,<sample>_unstranded.bedGraphis also moved - Log Files: All processing scripts generate comprehensive logs for debugging and monitoring
manifest.tsv columns:
sample_id,target,timepoint,replicate,r1,r2,out_dirfragments(blank unless--generate-TLEN-fragments),bigwig,y_readnames,bam
FASTQ_DIR=/storage/ATAC-short-test \
OUT_DIR=/storage/ATAC-short-test/output-yonly \
THREADS=2 JOBS=1 \
./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs --samples=ATAC-CHD4-0h-2FASTQ_DIR=/storage/ATAC-bulk OUT_DIR=/storage/ATAC-bulk/output \
THREADS=16 JOBS=4 \
./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs \
--samples=SAMPLE1,SAMPLE2,SAMPLE3,SAMPLE4./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs --samples=TEST_SAMPLE./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs \
--generate-TLEN-fragments --samples=ATAC-CHD4-0h-2./runDocker.sh --y-contig-labels=chrY,Y,chr24 --y-names-only --deterministic-bam --macs-gsize=hs \
--samples=SAMPLE_WITH_CUSTOM_YA parallel helper for generating ChromBPNet-compatible BigWig and fragments files from BAM inputs. By default it runs inside the Chromap Docker image for a consistent toolchain; use --native to execute entirely on the host.
Usage:
./create_helper_files.sh [--native|--no-docker] [--docker-image IMAGE] \
[--docker-arg ARG] [--docker-volume SPEC] \
[--overwrite] [--scan_until_file FILE] \
[--scan_interval SECONDS] BAM_DIRECTORY CHROM_SIZES_FILEKey Features:
- Docker-backed execution with optional native fallback (
--native) - Parallel processing with configurable worker count (
MAX_JOBS) - Continuous monitoring for streaming workflows (
--scan_until_file) - Smart overwrite handling with per-file skip logic (
--overwrite) - Single-pass BAM streaming to emit BigWig and fragments simultaneously
Important Options:
--native,--no-docker— run directly on the host without Docker--docker-image IMAGE— override the defaultbiodepot/chromap:0.3.2container--docker-arg ARG— forward extra flags todocker run(repeatable)--docker-volume SPEC— mount additional host paths (host:container, repeatable)--overwrite— replace existing.bwand.fragments.bed.gz--scan_until_file FILE/--scan_interval SEC— enable continuous polling mode
Environment Variables:
MAX_JOBS,SORT_MEM,OVERWRITE,SCAN_UNTIL_FILE,SCAN_INTERVALCREATE_HELPER_DOCKER_IMAGE— override the default Docker image without CLI flags
Examples:
# Default Docker-backed execution
./create_helper_files.sh /data/bams /ref/hg38.chrom.sizes
# Run natively in the current environment
./create_helper_files.sh --native /data/bams /ref/hg38.chrom.sizes
# Continuous monitoring until done file appears
MAX_JOBS=8 ./create_helper_files.sh --scan_until_file /workflow/done.txt \
/data/bams /ref/chrom.sizes
# Provide an extra volume and custom Docker image
./create_helper_files.sh --docker-image my/image:tag \
--docker-volume /scratch:/scratch \
/data/bams /ref/hg38.chrom.sizes
# Force overwrite existing outputs
./create_helper_files.sh --overwrite /data/bams /ref/chrom.sizesA comprehensive peak calling script supporting MACS2 and/or MACS3 with advanced workflow features.
Usage:
./generate_peaks.sh [OPTIONS] BAM_DIRECTORY GENOME_SIZE [BLACKLIST_BED]Key Features:
- Flexible Peak Callers: Choose MACS2, MACS3, or both (
--peak-caller) - Parallel Processing: Configurable concurrent jobs with progress tracking
- Continuous Monitoring: Process BAM files as they appear in workflows
- Blacklist Filtering: Optional post-processing peak filtering
- Overwrite Control: Smart detection of existing outputs
- Tool Validation: Automatic checking of MACS2/MACS3 availability
Options:
--peak-caller {macs2|macs3|macs23}: Peak caller selection (default: macs23)--overwrite: Force overwrite of existing peak files--scan_until_file FILE: Enable continuous monitoring mode--scan_interval SECONDS: Scan interval for continuous mode (default: 60s)
Arguments:
BAM_DIRECTORY: Directory containing BAM files to processGENOME_SIZE: Genome size for MACS (e.g., "hs", "mm", or numeric value)BLACKLIST_BED: Optional blacklist BED file for filtering peaks
Environment Variables:
MAX_JOBS: Maximum parallel jobs (default: number of CPU cores)PEAK_CALLER: Peak caller selection (default: macs23)OVERWRITE: Set to 1 to overwrite existing files (default: 0)
Examples:
# Basic peak calling with both MACS2 and MACS3
./generate_peaks.sh /data/bams hs
# Use only MACS2 with blacklist filtering
./generate_peaks.sh --peak-caller macs2 /data/bams hs /ref/blacklist.bed
# Continuous processing for workflow integration
./generate_peaks.sh --scan_until_file /workflow/done.txt /data/bams mm
# Parallel processing with custom settings
MAX_JOBS=8 ./generate_peaks.sh --overwrite --peak-caller macs23 /data/bams hsOutput Files:
${sample_id}.macs2_peaks.narrowPeak- MACS2 narrow peaks${sample_id}.macs2_summits.bed- MACS2 peak summits${sample_id}.macs3_peaks.narrowPeak- MACS3 narrow peaks (if enabled)${sample_id}.macs3_summits.bed- MACS3 peak summits (if enabled)${sample_id}.macs2_peaks.filtered.narrowPeak- Blacklist-filtered peaks (if blacklist provided)${prefix}.macs2.log,${prefix}.macs3.log- Peak calling logs
- Primary Pipeline: Use Tn5‑shifted bigWig (
<sample>_unstranded.bw) and peak set (MACS2 or MACS3*.narrowPeak) - Helper File Generation: Use
create_helper_files.shto build BigWig and fragments files (Docker by default,--nativeto stay on host) - Peak Calling: Use
generate_peaks.shfor flexible MACS2/MACS3 peak calling with blacklist filtering - Blacklist Filtering: Apply by setting
BLACKLIST_BEDduring pipeline or use--blacklistwith processing scripts - Compatibility: Ensure contig names match across bigWig, peaks, and reference files
- Parallelism:
JOBS= concurrent samples;THREADS= Chromap threads per sample. - RAM caps: set
SORT_MEM(GNU sort) andSAMTOOLS_SORT_MEM(samtools sort). UseTMPDIRon a fast disk. - Y readnames memory: deduplication is done by
sort -uwithSORT_MEM_YNAMES(default 64M). Increase if disk is slow; decrease if RAM is tight. EnsureTMPDIRhas space for spill files. - Missing bigWigs: confirm
bedGraphToBigWigis available (in Docker), setSAVE_BEDGRAPH=1to debug and check sort order (sort -c -k1,1 -k2,2n). - Y contigs: auto‑detected from
CHROM_SIZES; override with--y-contig-labelsas needed. - If hanging, check disk space and I/O performance
- Monitor Docker container resources
- FASTQ Discovery: Scans
FASTQ_DIRfor paired-end files - Sample Pairing: Auto-matches R1/R2 files by naming patterns
- Chromap Alignment: Runs bulk paired-end alignment with Tn5 settings
- Stream Processing: Single-pass SAM processing with
teefor:- Y-chromosome read extraction (AWK)
- Fragment generation (AWK + TLEN)
- BAM coordinate sorting (samtools)
- BigWig generation (ChromBPNet helper)
- Output Organization: Moves files to structured directory hierarchy
- Manifest Generation: Creates comprehensive tracking TSV
When --ChromBPNet is enabled:
- Fragments stream directly to helper via stdin
- Applies ATAC Tn5 shifts: +4 (forward), -5 (reverse)
- Uses
--per-sample-shiftfor adaptive shift estimation - Falls back to bedGraphToBigWig if helper fails (unless
--no-bw-fallback)
- Streaming architecture minimizes memory footprint
- Concurrent job processing with configurable limits
- Single-pass SAM processing reduces I/O overhead
- In-memory sorting with configurable thread counts
The pipeline includes comprehensive test suites for all major components to ensure reliability and compatibility.
All test scripts are located in the tests/ directory and can be run independently or as part of a comprehensive validation suite.
tests/test_overwrite_flag.sh
- Tests the
--overwritefunctionality inrunChromap.sh - Validates skipping existing files vs. forced reprocessing
- Ensures proper file handling and directory structure
tests/run_serial_parallel_compare.sh
- Compares serial vs. parallel processing results
- Validates output consistency across different concurrency levels
- Performance benchmarking and correctness verification
tests/test_chrombpnet_native.sh
- Tests
create_helper_files.shin native mode (--native) - Validates BigWig and fragments file generation
- Tests single-threaded and multi-threaded execution
- Verifies BAM sorting and paired-end processing
tests/test_chrombpnet_docker.sh
- Tests
create_helper_files.shinside Docker - Validates Docker volume mounting and tool availability
- Tests container-specific environment variables
- Ensures compatibility with
biodepot/chromap:0.3.2
tests/test_chrombpnet_overwrite.sh
- Tests overwrite functionality in
create_helper_files.sh - Validates file existence checking and skip behavior
- Tests both flag-based and environment variable overrides
tests/test_chrombpnet_continuous.sh
- Tests continuous monitoring functionality (
--scan_until_file) increate_helper_files.sh - Validates file detection, processing queues, and done file handling
- Tests custom scan intervals and interrupt handling
- Simulates workflow automation scenarios
tests/test_generate_peaks_basic.sh
- Tests basic
generate_peaks.shfunctionality - Validates MACS2/MACS3 peak calling with different configurations
- Tests overwrite functionality and parallel processing
- Validates tool availability checking and error handling
tests/test_generate_peaks_continuous.sh
- Tests continuous monitoring features in
generate_peaks.sh - Validates file detection and done file handling
- Tests custom scan intervals and interrupt handling
- Simulates automated peak calling workflows
tests/test_generate_peaks_docker.sh
- Tests
generate_peaks.shwithin Docker environment - Validates container-based peak calling with volume mounting
- Tests parallel processing and tool availability in containers
- Ensures Docker compatibility and output file generation
Individual Test Execution:
# Run specific test suite
./tests/test_chrombpnet_native.sh
./tests/test_generate_peaks_basic.sh
./tests/test_chrombpnet_docker.sh
# Run with timeout for continuous tests
timeout 60 ./tests/test_chrombpnet_continuous.sh
timeout 120 ./tests/test_generate_peaks_docker.shBatch Test Execution:
# Run all ChromBPNet tests
for test in tests/test_chrombpnet_*.sh; do
echo "Running $test..."
timeout 60 "$test" || echo "Test $test failed or timed out"
done
# Run all peak calling tests
for test in tests/test_generate_peaks_*.sh; do
echo "Running $test..."
timeout 120 "$test" || echo "Test $test failed or timed out"
doneTests use sample data from /storage/ATAC-short-test/ directory:
- Small BAM files for quick validation
- Representative ATAC-seq data structure
- Compatible with both native and Docker environments
Each test script validates:
- Output File Generation: Checks for expected output files
- File Integrity: Validates file sizes and formats
- Process Behavior: Tests skip/overwrite logic and error handling
- Tool Compatibility: Ensures proper tool availability and execution
- Docker Integration: Validates container mounting and execution
- Continuous Monitoring: Tests workflow automation features
Common Issues and Solutions:
# Check Docker availability
docker info
# Verify test data exists
ls -la /storage/ATAC-short-test/
# Check tool availability
command -v macs2 macs3 bedtools samtools
# Monitor test execution with verbose output
bash -x ./tests/test_name.sh
# Check test logs and temporary directories
ls -la /tmp/test-*Test Environment Cleanup:
# Clean up test directories
rm -rf /tmp/test-*
rm -rf /tmp/*-test*
# Stop any hanging Docker containers
docker ps -q --filter ancestor="biodepot/chromap:0.3.2" | xargs -r docker killWhen adding new features:
- Create corresponding test scripts in
tests/directory - Follow existing naming conventions:
test_component_feature.sh - Include comprehensive validation: file existence, content verification, error handling
- Test both native and Docker environments when applicable
- Add timeout handling for long-running or continuous tests
- Document test requirements and expected behavior
This pipeline is designed for bulk ATAC-seq processing with Y-chromosome focus and ChromBPNet compatibility. For modifications:
- Test thoroughly: Run relevant test suites before and after changes
- Use DRYRUN mode: Test pipeline changes with
DRYRUN=1first - Validate outputs: Ensure output structure and manifest consistency
- Docker compatibility: Test changes in Docker environment
- Update documentation: Document new flags, features, and test cases
- Add tests: Create test cases for new functionality
MIT License - see LICENSE file for details.