Skip to content

morphic-bio/atac-seq

Repository files navigation

Morphic ATAC‑seq Pipeline

A comprehensive Docker‑friendly pipeline for bulk ATAC‑seq analysis using Chromap. It produces Y‑chromosome readnames, Tn5‑shifted bigWigs, coordinate‑sorted BAMs (+BAI), MACS2/MACS3 peaks, and ChromBPNet-compatible preprocessing outputs.

Overview

What it does:

  • Auto‑pairs R1/R2 FASTQs
  • Extracts Y‑chromosome evidence (RNAME/RNEXT/SA:Z) → <sample>.y_readnames.txt
  • Generates bigWigs from BAM with ATAC Tn5 shifts (+4/−5) and bounded memory
  • Writes coordinate‑sorted <sample>.bam (+ .bai) and a per‑sample summary
  • Calls peaks with MACS2 and MACS3 (*.narrowPeak, logs, xls, summits)
  • Optionally produces TLEN‑derived fragments (--generate-TLEN-fragments)

Key Features

  • Single‑pass mapping, parallel consumers (BAM sort + Y names)
  • Bounded memory sorts (SORT_MEM, SAMTOOLS_SORT_MEM) and disk spill (TMPDIR)
  • Parallel per‑sample processing (JOBS) and per‑sample threads (THREADS)
  • Docker image includes bedtools, samtools, MACS2/3, bedGraphToBigWig
  • Manifest TSV tracks all outputs

Quick Start

Prerequisites

  • Docker (or run natively with bedtools, samtools, macs2, macs3, bedGraphToBigWig)
  • Reference FASTA, Chromap index, chrom.sizes (consistent contig naming)

Basic Usage (Docker)

# Build image (one time)
./build_chromap_container.sh

# Real run: 4 concurrent samples × 4 Chromap threads
FASTQ_DIR=/storage/ATAC-Seq \
OUT_DIR=/storage/ATAC-Seq/output \
OUTPUT_BAM_DIR=/mnt/pikachu/ATAC-results \
THREADS=4 JOBS=4 \
./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs

# Optional: also write TLEN fragments
FASTQ_DIR=/storage/ATAC-Seq OUT_DIR=/storage/ATAC-Seq/output OUTPUT_BAM_DIR=/mnt/pikachu/ATAC-results \
THREADS=4 JOBS=4 ./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs --generate-TLEN-fragments

# Optional: blacklist post‑filtering of MACS peaks
BLACKLIST_BED=/path/to/blacklist.bed \
FASTQ_DIR=/storage/ATAC-Seq OUT_DIR=/storage/ATAC-Seq/output OUTPUT_BAM_DIR=/mnt/pikachu/ATAC-results \
THREADS=4 JOBS=4 ./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs

Environment Variables

Variable Default Description
FASTQ_DIR /storage/ATAC-Seq Input FASTQs
OUT_DIR /storage/ATAC-Seq/output Staging + manifest
OUTPUT_BAM_DIR /mnt/pikachu/ATAC-Seq-output Final outputs (moved after MACS)
REF_FASTA /storage/scRNAseq_output/indices-110-44/chromap/genome.fa Reference FASTA
INDEX_FILE /storage/scRNAseq_output/indices-110-44/chromap/index Chromap index
CHROM_SIZES /storage/scRNAseq_output/indices-110-44/chromap/chrNameLength.txt Chrom sizes
THREADS 4 Chromap threads per sample
JOBS 4 Concurrent samples
DRYRUN 0 If 1, print heavy steps only
GENOME_SIZE hs MACS genome size (e.g., hs, mm)
SORT_MEM 512M Memory per GNU sort stage
SAMTOOLS_SORT_MEM 512M Memory per thread for samtools sort
SORT_MEM_YNAMES 64M Memory for sort -u when deduplicating Y readnames
SAVE_BEDGRAPH 0 If 1, keep <sample>_unstranded.bedGraph
BLACKLIST_BED unset Path to blacklist BED for post‑filtering

Command Line Flags

Core flags:

  • --samples=SID1,SID2,...: Process a subset
  • --number_of_jobs=N: Concurrency (alt to JOBS env)
  • --deterministic-bam: Single‑threaded BAM sort for bitwise stability
  • --macs-gsize=STR: Override GENOME_SIZE
  • --generate-TLEN-fragments: Write TLEN‑derived fragments (3‑col)

Y‑chromosome:

  • --y-contig-labels=chrY,Y,...: Override Y contigs (auto‑detected otherwise)
  • --y-names-only / --y-only: Enable Y‑names extraction (default on)

BigWigs:

  • Generated from BAM via bedtools/awk/genomecov with +4/−5 Tn5 shifts. Set SAVE_BEDGRAPH=1 to keep the bedGraph.

Legacy/deprecated:

  • ChromBPNet helper flags (--ChromBPNet, --per-sample-shift, --no-bw-fallback) are no longer used.
  • --remove-Y deprecated in Y‑only mode.

Output Structure

Primary Pipeline Output

OUTPUT_BAM_DIR/<target>/<timepoint>/<sample_id>/
├── <sample>.bam                       # Coordinate‑sorted BAM
├── <sample>.bam.bai                   # BAM index
├── <sample>_unstranded.bw            # BigWig (Tn5 +4/−5)
├── <sample>.y_readnames.txt          # Y readnames
├── <sample>.fragments.bed            # Optional (with --generate-TLEN-fragments)
├── <sample>.macs2_peaks.narrowPeak   # MACS2 peaks
├── <sample>.macs3_peaks.narrowPeak   # MACS3 peaks
├── <sample>.macs2.log                # MACS2 log
├── <sample>.macs3.log                # MACS3 log
├── <sample>.macs2_peaks.xls          # MACS2 xls
├── <sample>.macs3_peaks.xls          # MACS3 xls
├── <sample>.macs2_summits.bed        # MACS2 summits
├── <sample>.macs3_summits.bed        # MACS3 summits
└── <sample>.summary.txt              # Chromap summary

Helper File Generation Output

When using create_helper_files.sh:

BAM_DIRECTORY/
├── <sample>.bam                      # Input BAM file
├── <sample>.bw                       # Generated BigWig (Tn5-shifted)
└── <sample>.fragments.bed.gz         # Generated fragments file (gzipped)

Peak Calling Output

When using generate_peaks.sh:

BAM_DIRECTORY/
├── <sample>.bam                           # Input BAM file
├── <sample>.macs2_peaks.narrowPeak       # MACS2 narrow peaks
├── <sample>.macs2_summits.bed            # MACS2 peak summits
├── <sample>.macs2_peaks.xls              # MACS2 detailed results
├── <sample>.macs2.log                    # MACS2 execution log
├── <sample>.macs3_peaks.narrowPeak       # MACS3 narrow peaks (if enabled)
├── <sample>.macs3_summits.bed            # MACS3 peak summits (if enabled)
├── <sample>.macs3_peaks.xls              # MACS3 detailed results (if enabled)
├── <sample>.macs3.log                    # MACS3 execution log (if enabled)
├── <sample>.macs2_peaks.filtered.narrowPeak  # Blacklist-filtered MACS2 peaks (if blacklist provided)
└── <sample>.macs3_peaks.filtered.narrowPeak  # Blacklist-filtered MACS3 peaks (if blacklist provided)

Processing Notes

  • Primary Pipeline: Files are moved to OUTPUT_BAM_DIR only after bigWig and MACS2/3 complete (or error out)
  • Additional Scripts: Output files are created in-place within the input BAM directory
  • Blacklist Filtering: Applied automatically if blacklist BED file is provided
  • Optional Files: If SAVE_BEDGRAPH=1, <sample>_unstranded.bedGraph is also moved
  • Log Files: All processing scripts generate comprehensive logs for debugging and monitoring

Manifest

manifest.tsv columns:

  • sample_id, target, timepoint, replicate, r1, r2, out_dir
  • fragments (blank unless --generate-TLEN-fragments), bigwig, y_readnames, bam

Examples

1) Single‑sample (Docker)

FASTQ_DIR=/storage/ATAC-short-test \
OUT_DIR=/storage/ATAC-short-test/output-yonly \
THREADS=2 JOBS=1 \
./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs --samples=ATAC-CHD4-0h-2

2) Multiple samples with custom threading

FASTQ_DIR=/storage/ATAC-bulk OUT_DIR=/storage/ATAC-bulk/output \
THREADS=16 JOBS=4 \
./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs \
  --samples=SAMPLE1,SAMPLE2,SAMPLE3,SAMPLE4

3) Deterministic BAMs

./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs --samples=TEST_SAMPLE

4) Include fragments

./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs \
  --generate-TLEN-fragments --samples=ATAC-CHD4-0h-2

5) Custom Y contigs

./runDocker.sh --y-contig-labels=chrY,Y,chr24 --y-names-only --deterministic-bam --macs-gsize=hs \
  --samples=SAMPLE_WITH_CUSTOM_Y

Additional Processing Scripts

Helper File Generation (create_helper_files.sh)

A parallel helper for generating ChromBPNet-compatible BigWig and fragments files from BAM inputs. By default it runs inside the Chromap Docker image for a consistent toolchain; use --native to execute entirely on the host.

Usage:

./create_helper_files.sh [--native|--no-docker] [--docker-image IMAGE] \
                         [--docker-arg ARG] [--docker-volume SPEC] \
                         [--overwrite] [--scan_until_file FILE] \
                         [--scan_interval SECONDS] BAM_DIRECTORY CHROM_SIZES_FILE

Key Features:

  • Docker-backed execution with optional native fallback (--native)
  • Parallel processing with configurable worker count (MAX_JOBS)
  • Continuous monitoring for streaming workflows (--scan_until_file)
  • Smart overwrite handling with per-file skip logic (--overwrite)
  • Single-pass BAM streaming to emit BigWig and fragments simultaneously

Important Options:

  • --native, --no-docker — run directly on the host without Docker
  • --docker-image IMAGE — override the default biodepot/chromap:0.3.2 container
  • --docker-arg ARG — forward extra flags to docker run (repeatable)
  • --docker-volume SPEC — mount additional host paths (host:container, repeatable)
  • --overwrite — replace existing .bw and .fragments.bed.gz
  • --scan_until_file FILE / --scan_interval SEC — enable continuous polling mode

Environment Variables:

  • MAX_JOBS, SORT_MEM, OVERWRITE, SCAN_UNTIL_FILE, SCAN_INTERVAL
  • CREATE_HELPER_DOCKER_IMAGE — override the default Docker image without CLI flags

Examples:

# Default Docker-backed execution
./create_helper_files.sh /data/bams /ref/hg38.chrom.sizes

# Run natively in the current environment
./create_helper_files.sh --native /data/bams /ref/hg38.chrom.sizes

# Continuous monitoring until done file appears
MAX_JOBS=8 ./create_helper_files.sh --scan_until_file /workflow/done.txt \
                                    /data/bams /ref/chrom.sizes

# Provide an extra volume and custom Docker image
./create_helper_files.sh --docker-image my/image:tag \
                         --docker-volume /scratch:/scratch \
                         /data/bams /ref/hg38.chrom.sizes

# Force overwrite existing outputs
./create_helper_files.sh --overwrite /data/bams /ref/chrom.sizes

Peak Calling (generate_peaks.sh)

A comprehensive peak calling script supporting MACS2 and/or MACS3 with advanced workflow features.

Usage:

./generate_peaks.sh [OPTIONS] BAM_DIRECTORY GENOME_SIZE [BLACKLIST_BED]

Key Features:

  • Flexible Peak Callers: Choose MACS2, MACS3, or both (--peak-caller)
  • Parallel Processing: Configurable concurrent jobs with progress tracking
  • Continuous Monitoring: Process BAM files as they appear in workflows
  • Blacklist Filtering: Optional post-processing peak filtering
  • Overwrite Control: Smart detection of existing outputs
  • Tool Validation: Automatic checking of MACS2/MACS3 availability

Options:

  • --peak-caller {macs2|macs3|macs23}: Peak caller selection (default: macs23)
  • --overwrite: Force overwrite of existing peak files
  • --scan_until_file FILE: Enable continuous monitoring mode
  • --scan_interval SECONDS: Scan interval for continuous mode (default: 60s)

Arguments:

  • BAM_DIRECTORY: Directory containing BAM files to process
  • GENOME_SIZE: Genome size for MACS (e.g., "hs", "mm", or numeric value)
  • BLACKLIST_BED: Optional blacklist BED file for filtering peaks

Environment Variables:

  • MAX_JOBS: Maximum parallel jobs (default: number of CPU cores)
  • PEAK_CALLER: Peak caller selection (default: macs23)
  • OVERWRITE: Set to 1 to overwrite existing files (default: 0)

Examples:

# Basic peak calling with both MACS2 and MACS3
./generate_peaks.sh /data/bams hs

# Use only MACS2 with blacklist filtering
./generate_peaks.sh --peak-caller macs2 /data/bams hs /ref/blacklist.bed

# Continuous processing for workflow integration
./generate_peaks.sh --scan_until_file /workflow/done.txt /data/bams mm

# Parallel processing with custom settings
MAX_JOBS=8 ./generate_peaks.sh --overwrite --peak-caller macs23 /data/bams hs

Output Files:

  • ${sample_id}.macs2_peaks.narrowPeak - MACS2 narrow peaks
  • ${sample_id}.macs2_summits.bed - MACS2 peak summits
  • ${sample_id}.macs3_peaks.narrowPeak - MACS3 narrow peaks (if enabled)
  • ${sample_id}.macs3_summits.bed - MACS3 peak summits (if enabled)
  • ${sample_id}.macs2_peaks.filtered.narrowPeak - Blacklist-filtered peaks (if blacklist provided)
  • ${prefix}.macs2.log, ${prefix}.macs3.log - Peak calling logs

Using ChromBPNet with These Outputs

  • Primary Pipeline: Use Tn5‑shifted bigWig (<sample>_unstranded.bw) and peak set (MACS2 or MACS3 *.narrowPeak)
  • Helper File Generation: Use create_helper_files.sh to build BigWig and fragments files (Docker by default, --native to stay on host)
  • Peak Calling: Use generate_peaks.sh for flexible MACS2/MACS3 peak calling with blacklist filtering
  • Blacklist Filtering: Apply by setting BLACKLIST_BED during pipeline or use --blacklist with processing scripts
  • Compatibility: Ensure contig names match across bigWig, peaks, and reference files

Troubleshooting & Performance

  • Parallelism: JOBS = concurrent samples; THREADS = Chromap threads per sample.
  • RAM caps: set SORT_MEM (GNU sort) and SAMTOOLS_SORT_MEM (samtools sort). Use TMPDIR on a fast disk.
  • Y readnames memory: deduplication is done by sort -u with SORT_MEM_YNAMES (default 64M). Increase if disk is slow; decrease if RAM is tight. Ensure TMPDIR has space for spill files.
  • Missing bigWigs: confirm bedGraphToBigWig is available (in Docker), set SAVE_BEDGRAPH=1 to debug and check sort order (sort -c -k1,1 -k2,2n).
  • Y contigs: auto‑detected from CHROM_SIZES; override with --y-contig-labels as needed.
  • If hanging, check disk space and I/O performance
  • Monitor Docker container resources

Technical Details

Processing Workflow

  1. FASTQ Discovery: Scans FASTQ_DIR for paired-end files
  2. Sample Pairing: Auto-matches R1/R2 files by naming patterns
  3. Chromap Alignment: Runs bulk paired-end alignment with Tn5 settings
  4. Stream Processing: Single-pass SAM processing with tee for:
    • Y-chromosome read extraction (AWK)
    • Fragment generation (AWK + TLEN)
    • BAM coordinate sorting (samtools)
    • BigWig generation (ChromBPNet helper)
  5. Output Organization: Moves files to structured directory hierarchy
  6. Manifest Generation: Creates comprehensive tracking TSV

ChromBPNet Integration

When --ChromBPNet is enabled:

  • Fragments stream directly to helper via stdin
  • Applies ATAC Tn5 shifts: +4 (forward), -5 (reverse)
  • Uses --per-sample-shift for adaptive shift estimation
  • Falls back to bedGraphToBigWig if helper fails (unless --no-bw-fallback)

Memory and Performance

  • Streaming architecture minimizes memory footprint
  • Concurrent job processing with configurable limits
  • Single-pass SAM processing reduces I/O overhead
  • In-memory sorting with configurable thread counts

Testing Framework

The pipeline includes comprehensive test suites for all major components to ensure reliability and compatibility.

Test Scripts Overview

All test scripts are located in the tests/ directory and can be run independently or as part of a comprehensive validation suite.

Core Pipeline Tests

tests/test_overwrite_flag.sh

  • Tests the --overwrite functionality in runChromap.sh
  • Validates skipping existing files vs. forced reprocessing
  • Ensures proper file handling and directory structure

tests/run_serial_parallel_compare.sh

  • Compares serial vs. parallel processing results
  • Validates output consistency across different concurrency levels
  • Performance benchmarking and correctness verification

Helper File Generation Tests

tests/test_chrombpnet_native.sh

  • Tests create_helper_files.sh in native mode (--native)
  • Validates BigWig and fragments file generation
  • Tests single-threaded and multi-threaded execution
  • Verifies BAM sorting and paired-end processing

tests/test_chrombpnet_docker.sh

  • Tests create_helper_files.sh inside Docker
  • Validates Docker volume mounting and tool availability
  • Tests container-specific environment variables
  • Ensures compatibility with biodepot/chromap:0.3.2

tests/test_chrombpnet_overwrite.sh

  • Tests overwrite functionality in create_helper_files.sh
  • Validates file existence checking and skip behavior
  • Tests both flag-based and environment variable overrides

tests/test_chrombpnet_continuous.sh

  • Tests continuous monitoring functionality (--scan_until_file) in create_helper_files.sh
  • Validates file detection, processing queues, and done file handling
  • Tests custom scan intervals and interrupt handling
  • Simulates workflow automation scenarios

Peak Calling Tests

tests/test_generate_peaks_basic.sh

  • Tests basic generate_peaks.sh functionality
  • Validates MACS2/MACS3 peak calling with different configurations
  • Tests overwrite functionality and parallel processing
  • Validates tool availability checking and error handling

tests/test_generate_peaks_continuous.sh

  • Tests continuous monitoring features in generate_peaks.sh
  • Validates file detection and done file handling
  • Tests custom scan intervals and interrupt handling
  • Simulates automated peak calling workflows

tests/test_generate_peaks_docker.sh

  • Tests generate_peaks.sh within Docker environment
  • Validates container-based peak calling with volume mounting
  • Tests parallel processing and tool availability in containers
  • Ensures Docker compatibility and output file generation

Running Tests

Individual Test Execution:

# Run specific test suite
./tests/test_chrombpnet_native.sh
./tests/test_generate_peaks_basic.sh
./tests/test_chrombpnet_docker.sh

# Run with timeout for continuous tests
timeout 60 ./tests/test_chrombpnet_continuous.sh
timeout 120 ./tests/test_generate_peaks_docker.sh

Batch Test Execution:

# Run all ChromBPNet tests
for test in tests/test_chrombpnet_*.sh; do
    echo "Running $test..."
    timeout 60 "$test" || echo "Test $test failed or timed out"
done

# Run all peak calling tests
for test in tests/test_generate_peaks_*.sh; do
    echo "Running $test..."
    timeout 120 "$test" || echo "Test $test failed or timed out"
done

Test Data Requirements

Tests use sample data from /storage/ATAC-short-test/ directory:

  • Small BAM files for quick validation
  • Representative ATAC-seq data structure
  • Compatible with both native and Docker environments

Test Output Validation

Each test script validates:

  • Output File Generation: Checks for expected output files
  • File Integrity: Validates file sizes and formats
  • Process Behavior: Tests skip/overwrite logic and error handling
  • Tool Compatibility: Ensures proper tool availability and execution
  • Docker Integration: Validates container mounting and execution
  • Continuous Monitoring: Tests workflow automation features

Debugging Failed Tests

Common Issues and Solutions:

# Check Docker availability
docker info

# Verify test data exists
ls -la /storage/ATAC-short-test/

# Check tool availability
command -v macs2 macs3 bedtools samtools

# Monitor test execution with verbose output
bash -x ./tests/test_name.sh

# Check test logs and temporary directories
ls -la /tmp/test-*

Test Environment Cleanup:

# Clean up test directories
rm -rf /tmp/test-*
rm -rf /tmp/*-test*

# Stop any hanging Docker containers
docker ps -q --filter ancestor="biodepot/chromap:0.3.2" | xargs -r docker kill

Contributing Test Cases

When adding new features:

  1. Create corresponding test scripts in tests/ directory
  2. Follow existing naming conventions: test_component_feature.sh
  3. Include comprehensive validation: file existence, content verification, error handling
  4. Test both native and Docker environments when applicable
  5. Add timeout handling for long-running or continuous tests
  6. Document test requirements and expected behavior

Contributing

This pipeline is designed for bulk ATAC-seq processing with Y-chromosome focus and ChromBPNet compatibility. For modifications:

  1. Test thoroughly: Run relevant test suites before and after changes
  2. Use DRYRUN mode: Test pipeline changes with DRYRUN=1 first
  3. Validate outputs: Ensure output structure and manifest consistency
  4. Docker compatibility: Test changes in Docker environment
  5. Update documentation: Document new flags, features, and test cases
  6. Add tests: Create test cases for new functionality

License

MIT License - see LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages