Efficient random access to subsequences in FASTA files using byte-level seeking.
pip install fastaccessFrom source (includes C++ backend for better performance):
pip install -e .The C++ backend requires a C++17 compiler and CMake 3.15+. If unavailable, falls back to pure Python.
from fastaccess import FastaStore
fa = FastaStore("genome.fa") # Builds index, caches for next time
seq = fa.fetch("chr1", 1000, 2000) # 1-based inclusive coordinatespath: Path to FASTA file (plain or gzip-compressed.fa.gz)use_cache: Save/load index from.fidxcache filecache_dir: Custom directory for cache file (useful for read-only FASTA directories)
| Method | Description |
|---|---|
fetch(name, start, stop, reverse_complement=False) |
Fetch subsequence (1-based inclusive) |
fetch_many(queries) |
Batch fetch list of (name, start, stop) tuples |
list_sequences() |
Get all sequence names |
get_length(name) |
Get sequence length |
get_description(name) |
Get FASTA header description |
get_info(name) |
Get dict with name, description, length |
rebuild_index() |
Force rebuild index and update cache |
is_cached() |
Check if loaded from cache |
cache_exists() |
Check if cache file exists |
get_cache_path() |
Get cache file path |
delete_cache() |
Delete cache file |
KeyError: Sequence name not foundValueError: Invalid coordinates (start < 1, stop < start, stop > length)
- Random access: Uses
seek()to fetch only required bytes - Index caching: 7-40x faster reloading via
.fidxcache files - Gzip support: Reads
.fa.gzfiles directly - 1-based inclusive coordinates: Standard bioinformatics convention
- Format support: Wrapped/unwrapped sequences, Unix/Windows line endings
- Uppercase output: All sequences returned uppercase
| Operation | Python | C++ | Speedup |
|---|---|---|---|
| Index build (10MB) | 70 ms | 5 ms | 13x |
| Reverse complement (8 KB) | 0.21 ms | 0.015 ms | 14x |
| Small fetch (100 bp) | 0.017 ms | 0.017 ms | 1x |
| Large fetch (100 KB) | 0.36 ms | 0.35 ms | 1x |
Check if C++ backend is active:
from fastaccess import using_cpp_backend
print(using_cpp_backend()) # True if availableHuman genome (3 GB):
First load: ~2 seconds (builds index)
With cache: 0.05 seconds (40x faster)
Cache is automatically invalidated when the FASTA file changes.
from fastaccess import FastaStore
fa = FastaStore("hg38.fa")
# Get sequence info
print(fa.list_sequences()) # ["chr1", "chr2", ...]
print(fa.get_length("chr1")) # 248956422
# Fetch regions
seq = fa.fetch("chr1", 1000, 2000)
rc = fa.fetch("chr1", 1000, 2000, reverse_complement=True)
# Batch fetch
regions = [("chr1", 1, 100), ("chr2", 500, 600)]
sequences = fa.fetch_many(regions)- Python 3.8+
- No runtime dependencies (pure Python fallback always works)
C++ backend (optional):
- C++17 compiler
- CMake 3.15+
- ASCII sequences only (DNA/RNA)
- Gzip files require full decompression (no random access within compressed data)
MIT