Morphic ATAC‑seq Pipeline

A comprehensive Docker‑friendly pipeline for bulk ATAC‑seq analysis using Chromap. It produces Y‑chromosome readnames, Tn5‑shifted bigWigs, coordinate‑sorted BAMs (+BAI), MACS2/MACS3 peaks, and ChromBPNet-compatible preprocessing outputs.

Overview

What it does:

Auto‑pairs R1/R2 FASTQs
Extracts Y‑chromosome evidence (RNAME/RNEXT/SA:Z) → <sample>.y_readnames.txt
Generates bigWigs from BAM with ATAC Tn5 shifts (+4/−5) and bounded memory
Writes coordinate‑sorted <sample>.bam (+ .bai) and a per‑sample summary
Calls peaks with MACS2 and MACS3 (*.narrowPeak, logs, xls, summits)
Optionally produces TLEN‑derived fragments (--generate-TLEN-fragments)

Key Features

Single‑pass mapping, parallel consumers (BAM sort + Y names)
Bounded memory sorts (SORT_MEM, SAMTOOLS_SORT_MEM) and disk spill (TMPDIR)
Parallel per‑sample processing (JOBS) and per‑sample threads (THREADS)
Docker image includes bedtools, samtools, MACS2/3, bedGraphToBigWig
Manifest TSV tracks all outputs

Quick Start

Prerequisites

Docker (or run natively with bedtools, samtools, macs2, macs3, bedGraphToBigWig)
Reference FASTA, Chromap index, chrom.sizes (consistent contig naming)

Basic Usage (Docker)

# Build image (one time)
./build_chromap_container.sh

# Real run: 4 concurrent samples × 4 Chromap threads
FASTQ_DIR=/storage/ATAC-Seq \
OUT_DIR=/storage/ATAC-Seq/output \
OUTPUT_BAM_DIR=/mnt/pikachu/ATAC-results \
THREADS=4 JOBS=4 \
./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs

# Optional: also write TLEN fragments
FASTQ_DIR=/storage/ATAC-Seq OUT_DIR=/storage/ATAC-Seq/output OUTPUT_BAM_DIR=/mnt/pikachu/ATAC-results \
THREADS=4 JOBS=4 ./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs --generate-TLEN-fragments

# Optional: blacklist post‑filtering of MACS peaks
BLACKLIST_BED=/path/to/blacklist.bed \
FASTQ_DIR=/storage/ATAC-Seq OUT_DIR=/storage/ATAC-Seq/output OUTPUT_BAM_DIR=/mnt/pikachu/ATAC-results \
THREADS=4 JOBS=4 ./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs

Environment Variables

Variable	Default	Description
`FASTQ_DIR`	`/storage/ATAC-Seq`	Input FASTQs
`OUT_DIR`	`/storage/ATAC-Seq/output`	Staging + manifest
`OUTPUT_BAM_DIR`	`/mnt/pikachu/ATAC-Seq-output`	Final outputs (moved after MACS)
`REF_FASTA`	`/storage/scRNAseq_output/indices-110-44/chromap/genome.fa`	Reference FASTA
`INDEX_FILE`	`/storage/scRNAseq_output/indices-110-44/chromap/index`	Chromap index
`CHROM_SIZES`	`/storage/scRNAseq_output/indices-110-44/chromap/chrNameLength.txt`	Chrom sizes
`THREADS`	`4`	Chromap threads per sample
`JOBS`	`4`	Concurrent samples
`DRYRUN`	`0`	If `1`, print heavy steps only
`GENOME_SIZE`	`hs`	MACS genome size (e.g., hs, mm)
`SORT_MEM`	`512M`	Memory per GNU sort stage
`SAMTOOLS_SORT_MEM`	`512M`	Memory per thread for `samtools sort`
`SORT_MEM_YNAMES`	`64M`	Memory for `sort -u` when deduplicating Y readnames
`SAVE_BEDGRAPH`	`0`	If `1`, keep `<sample>_unstranded.bedGraph`
`BLACKLIST_BED`	unset	Path to blacklist BED for post‑filtering

Command Line Flags

Core flags:

--samples=SID1,SID2,...: Process a subset
--number_of_jobs=N: Concurrency (alt to JOBS env)
--deterministic-bam: Single‑threaded BAM sort for bitwise stability
--macs-gsize=STR: Override GENOME_SIZE
--generate-TLEN-fragments: Write TLEN‑derived fragments (3‑col)

Y‑chromosome:

--y-contig-labels=chrY,Y,...: Override Y contigs (auto‑detected otherwise)
--y-names-only / --y-only: Enable Y‑names extraction (default on)

BigWigs:

Generated from BAM via bedtools/awk/genomecov with +4/−5 Tn5 shifts. Set SAVE_BEDGRAPH=1 to keep the bedGraph.

Legacy/deprecated:

ChromBPNet helper flags (--ChromBPNet, --per-sample-shift, --no-bw-fallback) are no longer used.
--remove-Y deprecated in Y‑only mode.

Output Structure

Primary Pipeline Output

OUTPUT_BAM_DIR/<target>/<timepoint>/<sample_id>/
├── <sample>.bam                       # Coordinate‑sorted BAM
├── <sample>.bam.bai                   # BAM index
├── <sample>_unstranded.bw            # BigWig (Tn5 +4/−5)
├── <sample>.y_readnames.txt          # Y readnames
├── <sample>.fragments.bed            # Optional (with --generate-TLEN-fragments)
├── <sample>.macs2_peaks.narrowPeak   # MACS2 peaks
├── <sample>.macs3_peaks.narrowPeak   # MACS3 peaks
├── <sample>.macs2.log                # MACS2 log
├── <sample>.macs3.log                # MACS3 log
├── <sample>.macs2_peaks.xls          # MACS2 xls
├── <sample>.macs3_peaks.xls          # MACS3 xls
├── <sample>.macs2_summits.bed        # MACS2 summits
├── <sample>.macs3_summits.bed        # MACS3 summits
└── <sample>.summary.txt              # Chromap summary

Helper File Generation Output

When using create_helper_files.sh:

BAM_DIRECTORY/
├── <sample>.bam                      # Input BAM file
├── <sample>.bw                       # Generated BigWig (Tn5-shifted)
└── <sample>.fragments.bed.gz         # Generated fragments file (gzipped)

Peak Calling Output

When using generate_peaks.sh:

BAM_DIRECTORY/
├── <sample>.bam                           # Input BAM file
├── <sample>.macs2_peaks.narrowPeak       # MACS2 narrow peaks
├── <sample>.macs2_summits.bed            # MACS2 peak summits
├── <sample>.macs2_peaks.xls              # MACS2 detailed results
├── <sample>.macs2.log                    # MACS2 execution log
├── <sample>.macs3_peaks.narrowPeak       # MACS3 narrow peaks (if enabled)
├── <sample>.macs3_summits.bed            # MACS3 peak summits (if enabled)
├── <sample>.macs3_peaks.xls              # MACS3 detailed results (if enabled)
├── <sample>.macs3.log                    # MACS3 execution log (if enabled)
├── <sample>.macs2_peaks.filtered.narrowPeak  # Blacklist-filtered MACS2 peaks (if blacklist provided)
└── <sample>.macs3_peaks.filtered.narrowPeak  # Blacklist-filtered MACS3 peaks (if blacklist provided)

Processing Notes

Primary Pipeline: Files are moved to OUTPUT_BAM_DIR only after bigWig and MACS2/3 complete (or error out)
Additional Scripts: Output files are created in-place within the input BAM directory
Blacklist Filtering: Applied automatically if blacklist BED file is provided
Optional Files: If SAVE_BEDGRAPH=1, <sample>_unstranded.bedGraph is also moved
Log Files: All processing scripts generate comprehensive logs for debugging and monitoring

Manifest

manifest.tsv columns:

sample_id, target, timepoint, replicate, r1, r2, out_dir
fragments (blank unless --generate-TLEN-fragments), bigwig, y_readnames, bam

Examples

1) Single‑sample (Docker)

FASTQ_DIR=/storage/ATAC-short-test \
OUT_DIR=/storage/ATAC-short-test/output-yonly \
THREADS=2 JOBS=1 \
./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs --samples=ATAC-CHD4-0h-2

2) Multiple samples with custom threading

FASTQ_DIR=/storage/ATAC-bulk OUT_DIR=/storage/ATAC-bulk/output \
THREADS=16 JOBS=4 \
./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs \
  --samples=SAMPLE1,SAMPLE2,SAMPLE3,SAMPLE4

3) Deterministic BAMs

./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs --samples=TEST_SAMPLE

4) Include fragments

./runDocker.sh --y-names-only --deterministic-bam --macs-gsize=hs \
  --generate-TLEN-fragments --samples=ATAC-CHD4-0h-2

5) Custom Y contigs

./runDocker.sh --y-contig-labels=chrY,Y,chr24 --y-names-only --deterministic-bam --macs-gsize=hs \
  --samples=SAMPLE_WITH_CUSTOM_Y

Additional Processing Scripts

Helper File Generation (`create_helper_files.sh`)

A parallel helper for generating ChromBPNet-compatible BigWig and fragments files from BAM inputs. By default it runs inside the Chromap Docker image for a consistent toolchain; use --native to execute entirely on the host.

Usage:

./create_helper_files.sh [--native|--no-docker] [--docker-image IMAGE] \
                         [--docker-arg ARG] [--docker-volume SPEC] \
                         [--overwrite] [--scan_until_file FILE] \
                         [--scan_interval SECONDS] BAM_DIRECTORY CHROM_SIZES_FILE

Key Features:

Docker-backed execution with optional native fallback (--native)
Parallel processing with configurable worker count (MAX_JOBS)
Continuous monitoring for streaming workflows (--scan_until_file)
Smart overwrite handling with per-file skip logic (--overwrite)
Single-pass BAM streaming to emit BigWig and fragments simultaneously

Important Options:

--native, --no-docker — run directly on the host without Docker
--docker-image IMAGE — override the default biodepot/chromap:0.3.2 container
--docker-arg ARG — forward extra flags to docker run (repeatable)
--docker-volume SPEC — mount additional host paths (host:container, repeatable)
--overwrite — replace existing .bw and .fragments.bed.gz
--scan_until_file FILE / --scan_interval SEC — enable continuous polling mode

Environment Variables:

MAX_JOBS, SORT_MEM, OVERWRITE, SCAN_UNTIL_FILE, SCAN_INTERVAL
CREATE_HELPER_DOCKER_IMAGE — override the default Docker image without CLI flags

Examples:

# Default Docker-backed execution
./create_helper_files.sh /data/bams /ref/hg38.chrom.sizes

# Run natively in the current environment
./create_helper_files.sh --native /data/bams /ref/hg38.chrom.sizes

# Continuous monitoring until done file appears
MAX_JOBS=8 ./create_helper_files.sh --scan_until_file /workflow/done.txt \
                                    /data/bams /ref/chrom.sizes

# Provide an extra volume and custom Docker image
./create_helper_files.sh --docker-image my/image:tag \
                         --docker-volume /scratch:/scratch \
                         /data/bams /ref/hg38.chrom.sizes

# Force overwrite existing outputs
./create_helper_files.sh --overwrite /data/bams /ref/chrom.sizes

Peak Calling (`generate_peaks.sh`)

A comprehensive peak calling script supporting MACS2 and/or MACS3 with advanced workflow features.

Usage:

./generate_peaks.sh [OPTIONS] BAM_DIRECTORY GENOME_SIZE [BLACKLIST_BED]

Key Features:

Flexible Peak Callers: Choose MACS2, MACS3, or both (--peak-caller)
Parallel Processing: Configurable concurrent jobs with progress tracking
Continuous Monitoring: Process BAM files as they appear in workflows
Blacklist Filtering: Optional post-processing peak filtering
Overwrite Control: Smart detection of existing outputs
Tool Validation: Automatic checking of MACS2/MACS3 availability

Options:

--peak-caller {macs2|macs3|macs23}: Peak caller selection (default: macs23)
--overwrite: Force overwrite of existing peak files
--scan_until_file FILE: Enable continuous monitoring mode
--scan_interval SECONDS: Scan interval for continuous mode (default: 60s)

Arguments:

BAM_DIRECTORY: Directory containing BAM files to process
GENOME_SIZE: Genome size for MACS (e.g., "hs", "mm", or numeric value)
BLACKLIST_BED: Optional blacklist BED file for filtering peaks

Environment Variables:

MAX_JOBS: Maximum parallel jobs (default: number of CPU cores)
PEAK_CALLER: Peak caller selection (default: macs23)
OVERWRITE: Set to 1 to overwrite existing files (default: 0)

Examples:

# Basic peak calling with both MACS2 and MACS3
./generate_peaks.sh /data/bams hs

# Use only MACS2 with blacklist filtering
./generate_peaks.sh --peak-caller macs2 /data/bams hs /ref/blacklist.bed

# Continuous processing for workflow integration
./generate_peaks.sh --scan_until_file /workflow/done.txt /data/bams mm

# Parallel processing with custom settings
MAX_JOBS=8 ./generate_peaks.sh --overwrite --peak-caller macs23 /data/bams hs

Output Files:

${sample_id}.macs2_peaks.narrowPeak - MACS2 narrow peaks
${sample_id}.macs2_summits.bed - MACS2 peak summits
${sample_id}.macs3_peaks.narrowPeak - MACS3 narrow peaks (if enabled)
${sample_id}.macs3_summits.bed - MACS3 peak summits (if enabled)
${sample_id}.macs2_peaks.filtered.narrowPeak - Blacklist-filtered peaks (if blacklist provided)
${prefix}.macs2.log, ${prefix}.macs3.log - Peak calling logs

Using ChromBPNet with These Outputs

Primary Pipeline: Use Tn5‑shifted bigWig (<sample>_unstranded.bw) and peak set (MACS2 or MACS3 *.narrowPeak)
Helper File Generation: Use create_helper_files.sh to build BigWig and fragments files (Docker by default, --native to stay on host)
Peak Calling: Use generate_peaks.sh for flexible MACS2/MACS3 peak calling with blacklist filtering
Blacklist Filtering: Apply by setting BLACKLIST_BED during pipeline or use --blacklist with processing scripts
Compatibility: Ensure contig names match across bigWig, peaks, and reference files

Troubleshooting & Performance

Parallelism: JOBS = concurrent samples; THREADS = Chromap threads per sample.
RAM caps: set SORT_MEM (GNU sort) and SAMTOOLS_SORT_MEM (samtools sort). Use TMPDIR on a fast disk.
Y readnames memory: deduplication is done by sort -u with SORT_MEM_YNAMES (default 64M). Increase if disk is slow; decrease if RAM is tight. Ensure TMPDIR has space for spill files.
Missing bigWigs: confirm bedGraphToBigWig is available (in Docker), set SAVE_BEDGRAPH=1 to debug and check sort order (sort -c -k1,1 -k2,2n).
Y contigs: auto‑detected from CHROM_SIZES; override with --y-contig-labels as needed.
If hanging, check disk space and I/O performance
Monitor Docker container resources

Technical Details

Processing Workflow

FASTQ Discovery: Scans FASTQ_DIR for paired-end files
Sample Pairing: Auto-matches R1/R2 files by naming patterns
Chromap Alignment: Runs bulk paired-end alignment with Tn5 settings
Stream Processing: Single-pass SAM processing with tee for:
- Y-chromosome read extraction (AWK)
- Fragment generation (AWK + TLEN)
- BAM coordinate sorting (samtools)
- BigWig generation (ChromBPNet helper)
Output Organization: Moves files to structured directory hierarchy
Manifest Generation: Creates comprehensive tracking TSV

ChromBPNet Integration

When --ChromBPNet is enabled:

Fragments stream directly to helper via stdin
Applies ATAC Tn5 shifts: +4 (forward), -5 (reverse)
Uses --per-sample-shift for adaptive shift estimation
Falls back to bedGraphToBigWig if helper fails (unless --no-bw-fallback)

Memory and Performance

Streaming architecture minimizes memory footprint
Concurrent job processing with configurable limits
Single-pass SAM processing reduces I/O overhead
In-memory sorting with configurable thread counts

Testing Framework

The pipeline includes comprehensive test suites for all major components to ensure reliability and compatibility.

Test Scripts Overview

All test scripts are located in the tests/ directory and can be run independently or as part of a comprehensive validation suite.

Core Pipeline Tests

tests/test_overwrite_flag.sh

Tests the --overwrite functionality in runChromap.sh
Validates skipping existing files vs. forced reprocessing
Ensures proper file handling and directory structure

tests/run_serial_parallel_compare.sh

Compares serial vs. parallel processing results
Validates output consistency across different concurrency levels
Performance benchmarking and correctness verification

Helper File Generation Tests

tests/test_chrombpnet_native.sh

Tests create_helper_files.sh in native mode (--native)
Validates BigWig and fragments file generation
Tests single-threaded and multi-threaded execution
Verifies BAM sorting and paired-end processing

tests/test_chrombpnet_docker.sh

Tests create_helper_files.sh inside Docker
Validates Docker volume mounting and tool availability
Tests container-specific environment variables
Ensures compatibility with biodepot/chromap:0.3.2

tests/test_chrombpnet_overwrite.sh

Tests overwrite functionality in create_helper_files.sh
Validates file existence checking and skip behavior
Tests both flag-based and environment variable overrides

tests/test_chrombpnet_continuous.sh

Tests continuous monitoring functionality (--scan_until_file) in create_helper_files.sh
Validates file detection, processing queues, and done file handling
Tests custom scan intervals and interrupt handling
Simulates workflow automation scenarios

Peak Calling Tests

tests/test_generate_peaks_basic.sh

Tests basic generate_peaks.sh functionality
Validates MACS2/MACS3 peak calling with different configurations
Tests overwrite functionality and parallel processing
Validates tool availability checking and error handling

tests/test_generate_peaks_continuous.sh

Tests continuous monitoring features in generate_peaks.sh
Validates file detection and done file handling
Tests custom scan intervals and interrupt handling
Simulates automated peak calling workflows

tests/test_generate_peaks_docker.sh

Tests generate_peaks.sh within Docker environment
Validates container-based peak calling with volume mounting
Tests parallel processing and tool availability in containers
Ensures Docker compatibility and output file generation

Running Tests

Individual Test Execution:

# Run specific test suite
./tests/test_chrombpnet_native.sh
./tests/test_generate_peaks_basic.sh
./tests/test_chrombpnet_docker.sh

# Run with timeout for continuous tests
timeout 60 ./tests/test_chrombpnet_continuous.sh
timeout 120 ./tests/test_generate_peaks_docker.sh

Batch Test Execution:

# Run all ChromBPNet tests
for test in tests/test_chrombpnet_*.sh; do
    echo "Running $test..."
    timeout 60 "$test" || echo "Test $test failed or timed out"
done

# Run all peak calling tests
for test in tests/test_generate_peaks_*.sh; do
    echo "Running $test..."
    timeout 120 "$test" || echo "Test $test failed or timed out"
done

Test Data Requirements

Tests use sample data from /storage/ATAC-short-test/ directory:

Small BAM files for quick validation
Representative ATAC-seq data structure
Compatible with both native and Docker environments

Test Output Validation

Each test script validates:

Output File Generation: Checks for expected output files
File Integrity: Validates file sizes and formats
Process Behavior: Tests skip/overwrite logic and error handling
Tool Compatibility: Ensures proper tool availability and execution
Docker Integration: Validates container mounting and execution
Continuous Monitoring: Tests workflow automation features

Debugging Failed Tests

Common Issues and Solutions:

# Check Docker availability
docker info

# Verify test data exists
ls -la /storage/ATAC-short-test/

# Check tool availability
command -v macs2 macs3 bedtools samtools

# Monitor test execution with verbose output
bash -x ./tests/test_name.sh

# Check test logs and temporary directories
ls -la /tmp/test-*

Test Environment Cleanup:

# Clean up test directories
rm -rf /tmp/test-*
rm -rf /tmp/*-test*

# Stop any hanging Docker containers
docker ps -q --filter ancestor="biodepot/chromap:0.3.2" | xargs -r docker kill

Contributing Test Cases

When adding new features:

Create corresponding test scripts in tests/ directory
Follow existing naming conventions: test_component_feature.sh
Include comprehensive validation: file existence, content verification, error handling
Test both native and Docker environments when applicable
Add timeout handling for long-running or continuous tests
Document test requirements and expected behavior

Contributing

This pipeline is designed for bulk ATAC-seq processing with Y-chromosome focus and ChromBPNet compatibility. For modifications:

Test thoroughly: Run relevant test suites before and after changes
Use DRYRUN mode: Test pipeline changes with DRYRUN=1 first
Validate outputs: Ensure output structure and manifest consistency
Docker compatibility: Test changes in Docker environment
Update documentation: Document new flags, features, and test cases
Add tests: Create test cases for new functionality

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dev		dev
docs		docs
legacy		legacy
scripts		scripts
testing		testing
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
buildQC.sh		buildQC.sh
build_chromap_container.sh		build_chromap_container.sh
chromap		chromap
convert_fragments.sh		convert_fragments.sh
create_helper_files.sh		create_helper_files.sh
generate_peaks.sh		generate_peaks.sh
runChromap.sh		runChromap.sh
runDocker.sh		runDocker.sh
runDockerPeakAssign.sh		runDockerPeakAssign.sh
runDockerPreProcess.sh		runDockerPreProcess.sh

License

morphic-bio/atac-seq

Folders and files

Latest commit

History

Repository files navigation

Morphic ATAC‑seq Pipeline

Overview

Key Features

Quick Start

Prerequisites

Basic Usage (Docker)

Environment Variables

Command Line Flags

Output Structure

Primary Pipeline Output

Helper File Generation Output

Peak Calling Output

Processing Notes

Manifest

Examples

1) Single‑sample (Docker)

2) Multiple samples with custom threading

3) Deterministic BAMs

4) Include fragments

5) Custom Y contigs

Additional Processing Scripts

Helper File Generation (create_helper_files.sh)

Peak Calling (generate_peaks.sh)

Using ChromBPNet with These Outputs

Troubleshooting & Performance

Technical Details

Processing Workflow

ChromBPNet Integration

Memory and Performance

Testing Framework

Test Scripts Overview

Core Pipeline Tests

Helper File Generation Tests

Peak Calling Tests

Running Tests

Test Data Requirements

Test Output Validation

Debugging Failed Tests

Contributing Test Cases

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Helper File Generation (`create_helper_files.sh`)

Peak Calling (`generate_peaks.sh`)

Packages