Skip to content

gersteinlab/ASTRO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ASTRO

The usage instructions for the ASTRO Python package are as follows.

1. Functional Overview

Demultiplexing: Adapter trimming, UMI, and Barcode splitting.
Genome Mapping: Uses STAR to align reads to the genome and optionally removes duplicate reads using either samtools markdup or a custom deduplication module.
Feature Counting: Calculates gene expression from alignment results based on GTF annotation files and outputs an expression matrix.
Feature Filtering: Filters out low-quality or abnormal genes/barcodes based on user-defined thresholds.

2. Installation Guide

External dependencies: ASTRO requires the following external tools to function - STAR, bedtools, samtools, and cutadapt. If you have Python 3.6+ installed locally, follow these steps to set it up. You can install from source repository or compressed package:

2.1 Clone the repository:

git clone [email protected]:gersteinlab/ASTRO.git

2.2 Enter the directory named "python":

cd python

2.3 Install dependencies and build/install:

pip install .

2.4 Check if the installation was successful:

If you see the help documentation, the installation is complete:

ASTRO --help

Alternative: Docker Installation

The Docker image is hosted on Docker Hub and can be downloaded using the following command:

docker pull yc774/astro:v1.0

Alternative: Apptainer/Singularity

You can also run ASTRO using Apptainer (formerly Singularity). Download the pre-built .sif container directly from Zenodo (https://doi.org/10.5281/zenodo.17329668) or Gerstein lab website (http://archive2.gersteinlab.org/proj/ASTRO/).

3.Parameter Description

The ASTRO script accepts parameters via the command line or a JSON file. The main parameters are listed below. Certain parameters are only required for specific steps. If these steps are executed (--steps control) but their parameters are missing, the program will perform a runtime check and exit with an error.

Parameter Name Required Relevant Steps Default Description
barcode_read (R1) Yes Step 1 None FASTQ file containing spatial barcode and UMI information
transcript_read (R2) Yes Step 1 None FASTQ file containing main RNA sequences
barcode_file Yes Step 1 None Text file with barcode information; must have at least three columns: barcode, tX, tY
outputfolder Yes All Steps - Output directory for results
starref Yes Step 2 - STAR genome index directory
gtffile Yes Steps 2,3 None Path to the GTF file
PrimerStructure Yes Step 1 - Primer structure for transcript_read (R2), e.g. AAGCAGTGGTATCAACGCAGAGTGAATGGG_b_A{10}N{150}
StructureUMI Yes Step 1 - UMI structure definition, e.g. CAAGCGTTGGCTTCTCGCATCT_10
StructureBarcode Yes Step 1 - When manually_set_barcode_details is false, you must specify the positions of all spatial barcodes in order and join them with colon ":" in the option. You can describe each position in two ways:
1. Barcode before/after a linker: 8_ATCCACGTGCTT or AACCAAGATCG_8 means the barcode is 8 bp before or after the linker, respectively. 2. Barcode between two linkers: GAGGCCAAGATCG_8_GTGGCCGATGTTTCGC means the barcode is 8 bp long and lies between the two linkers. (Here, 8bp is the expected barcode length.).
When manually_set_barcode_details is true, you still specify the positions for ASTRO to search spatial barcodes in order and join them with colon ":".
Then, 1. Barcode before/after a linker (search window), 20_ATCCACGTGCTT or AACCAAGATCG_20 indicates ASTRO will search within a 20-bp window before or after the linker, respectively. (The value 20 is not the expected barcode length.) 2. Barcode between two linkers (no explicit length): TTGAGAGGCCAAGATC...GTGGCCGATGTTTC omits the length; ASTRO will search for the barcode between the two linkers.
threadnum No 16 Number of threads for parallel tasks like cutadapt, STAR, samtools, etc.
options No "" Enables extra modes, including:
H: Hard mode
M: Use samtools markup for deduplication
steps No 7 Specifies steps to execute; uses bitwise integers:
1: Step 1, Demultiplexing
2: Step 2, Genome Mapping (Interm. files req: combine.fq)
4: Step 3, Feature Counting and Filtering (Interm. files req: STAR/tempfiltered.bam)
For example, 7 = 1+2+4
STARparamfile4genome No NA File containing extra STAR parameters; specify the file path for custom STAR parameters
qualityfilter No "25:0.75" Quality filter threshold; default "25:0.75" means filtering if AS ≤ 25 and ≤ 0.75 of gene length. Set to 0:0 or NA to disable
addlowqreads No False Whether to remove genes/barcodes with abnormal row/column variance. Set to False to skip
filterlogratio No 2 Filters based on log2 variance differences; default is log2 > 2 (4x difference) for removal
workflow No new Specifies which pipeline to execute. If set to "old", the older workflow will run. If not specified or left as default, it will run the latest workflow.
barcodemode No Step 1 "spatial" When set to "singlecell", enables single-cell mode. If kept as "spatial", the pipeline runs in the conventional spatial mode.
barcode_threshold No Step 1 100 In single-cell mode, when extracting barcodes automatically, any barcode whose occurrence is below this threshold will be discarded.
barcodelength No Step 1 0 In single-cell mode, if greater than 0, only barcodes of length barcodelength will be retained. If set to 0, no length filtering is applied.
barcodeposition No Step 1 "NA" Specifies the position for extracting barcodes from R2 barcode reads (index_fq after singleCutadapt processing). Two formats:
1) <start>_<len>b: Extract from the 5' end, where start is the starting position (1-based) and len is the length.
   Example: 5_24b extracts 24 bp starting from position 5 (positions 5-28).
2) b<len>: Extract the last len bases from the 3' end.
   Example: b16 extracts the last 16 bp.
Note: The b<len1>_<len2> format has limitations and is only valid when len2 < len1. It extracts len2 bases from the len1-th position from the 3' end to the (len1-len2)-th position from the end.
When set to "NA", if manually_set_barcode_details=false (default), the program will automatically infer this parameter based on StructureBarcode (via the auto_set_barcodes() function).
barcode_file No (for single-cell mode)
Yes (for spatial mode)
Step 1 None or "notavailable" In spatial mode, this parameter must be provided by the user and must include barcode coordinate information. In single-cell mode, if a ready-made file is not available, set this to "notavailable", so ASTRO can generate a three-column barcode file automatically.
manually_set_barcode_details No Step 1 False If false, ASTRO will automatically set the barcode structures, and only needs neccssary input for the barcode information. If true, ASTRO needs details of how to extract barcodes. See details in StructureBarcode.
genes2check No Step 3 False If provided (a text file listing suspect genes, e.g. piRNAs/miRNAs), ASTRO runs an advanced check before feature counting. Genes failing this check are removed from the GTF annotation, effectively excluding them from the final expression matrix.
not_organize_result No Step 3 False ASTRO automatically delete temp files and compress intermediate files, this option will disable this step and preserve intermediate and temp files without organization.

3.1 Parameter Priority

ASTRO can receive parameters from the command line or a JSON file. JSON parameters can be specified as positional arguments (json_file_path1) or via --json_file_path. Priority is as follows:

1st: Command Line Parameters

Overrides JSON values if explicitly provided.

Example:

ASTRO --R1 R1.fq --R2 R2.fq \
--barcode_file spatial_barcodes.txt \
--gtffile hsa.no_piRNA.gtf --starref StarIndex/ \
--PrimerStructure AAGCAGTGGTATCAACGCAGAGTGAATGGG_b_A{10}N{150} \
--StructureUMI CAAGCGTTGGCTTCTCGCATCT_10 \
--StructureBarcode 20_ATCCACGTGCTTGAGAGGCCAGAGCATTCG:ATCCACGTGCTTGAGAGGCCAGAGCATTCG...GTGGCCGATGTTTCGCATCGGCGTACGACT \
--threadnum 16 \
--steps 7 \
--outputfolder output/

To enable hard mode (H) and samtools markdup deduplication (M), add:

--options HM

2nd: JSON File

  • --json_file_path: If specified, the program reads this file first.
    ASTRO --json_file_path myparams.json
  • json_file_path1: If --json_file_path is not provided, the program reads the positional JSON file.
    ASTRO parameter.json

3rd: Default Values

Used if neither command-line parameters nor JSON specify the value.

3.2 Key Modes Explanation

3.2.1 hard mode (H)

If options include H, it means that when a single read aligns to multiple genes, it will no longer only take the first gene but will record all aligned genes as a multi-gene form, separated by a hyphen.

3.2.2 M decides which way is used for remove depulicate reads.

If options include M, samtools markdup will be used to mark and remove duplicate reads. If M is not included, the built-in ASTRO deduplication logic will be used: This logic relies on UMIs and barcodes to determine duplicates and uses alignment score for filtering.

4.A Simple Example

4.1 Assume the following files are prepared:

  • example.barcodeRead.fastq.gz, example.transRead.fastq.gz: Input sequencing reads
  • spatial_barcodes.txt: Records coordinates and barcode sequences.
  • mouseIndex/: a STAR genome indexes for mouse genome.
  • mmu.mod.gtf: Gene annotation file

4.2 Create the following JSON file (example/test.json):

{
"barcode_read": "example.barcodeRead.fastq.gz",
"transcript_read": "example.transRead.fastq.gz",
"barcode_file": "spatial_barcodes.txt",
"PrimerStructure": "AAGCAGTGGTATCAACGCAGAGTGAATGGG_b_A{10}N{150}",
"StructureUMI": "CAAGCGTTGGCTTCTCGCATCT_10",
"StructureBarcode": "20_ATCCACGTGCTTGAGAGGCCAGAGCATTCG:ATCCACGTGCTTGAGAGGCCAGAGCATTCG...GTGGCCGATGTTTCGCATCGGCGTACGACT",
"threadnum": 1,
"steps": 7,
"outputfolder": "output/",
"gtffile": "mmu.mod.gtf",
"starref": "mouseIndex/",
"options": "H",
"barcodeposition": "b16",
"barcodelengthrange": "15_40"
}


## 4.3 Run the command:
```bash
ASTRO test.json

5. Single-Cell Mode

In the conventional spatial transcriptomics (spatial) mode, ASTRO requires a barcode_file that contains at least three columns (barcode sequence, X coordinate, and Y coordinate).

However, in single-cell mode (barcodemode="singlecell"), if the user does not provide an existing barcode file (or sets it to "notavailable"), ASTRO will, during Step 1 (Demultiplexing), automatically enumerate all possible barcodes from R2 (which typically contains cell barcode information) based on structurebarcode, and then filter them according to barcode_threshold (the minimum count of occurrences required for each barcode) and barcodelength (if greater than 0, only barcodes of that specific length are retained). It then generates an “auto-generated” three-column barcode file (in the form of barcode, i, i) for subsequent steps.

In the feature counting output, ASTRO will replace the temporary column names (ixi) with the actual single-cell barcodes, allowing these barcodes to be used directly as column names in the expression matrix. This mode is particularly suitable for single-cell sequencing scenarios where the precise barcodes are not known in advance, or when they need to be automatically extracted and filtered from raw sequence data.

If a barcode file is provided in single-cell mode, ASTRO will treat that file as a “whitelist”—after automatically enumerating barcodes, it retains only those that are both present in the whitelist and meet the threshold/length requirements. A three-column barcode file is then generated for the downstream workflow.

In single-cell mode, Step 4 (Feature Filtering) by default only creates the basic expression matrix and does not perform further row/column variance filtering. If needed, you can run that manually afterward or configure the relevant parameters.

6. Some further explantation

6.1 Symbol meanings in StructureBarcode / StructureUMI

A colon (:) means concatenating multiple segments.

“20_CGTTGGCTTCT”Means that, on the 3' end, we recognize the fixed adapter CGTTGGCTTCT.Then we take the 20 nucleotides to the left of that fixed sequence.In other words, find CGTTGGCTTCT at the read’s 3' end and extract the preceding 20 bases.

“CGTTGGCTTCT_20”Means that, on the 5' end, we recognize the fixed adapter CGTTGGCTTCT.Then we take the 20 nucleotides to the right of that fixed sequence.In other words, find CGTTGGCTTCT at the read’s 5' end and keep the next 20 bases.

“TTCTCGCATCT...ATCCACGTGCTTGA”Means we take what lies between two stable (known) sequences.Here, the left boundary is TTCTCGCATCT and the right boundary is ATCCACGTGCTTGA, so we extract whatever is in between them.

6.2 Two ways to write (or “point to”) barcode/UMI locations in R2

Method A: Use explicit numeric positions

When you already know the barcode is strictly located at positions (23–30, 61–68, 99–106) and the UMI is at (137–146), you can specify:

"StructureBarcode": "22_8:60_8:98_8"

"StructureUMI": "136_10"

For 22_8, it means: skip the first 22 bases, then keep the next 8 bases (covering R2 positions 23–30). For 60_8, it means: skip the first 60 bases, then keep the next 8 (positions 61–68). For 98_8, it means skip the first 98 bases, keep the next 8 (positions 99–106).

Similarly, 136_10 means skip 136 bases, then keep the next 10 (positions 137–146) for the UMI.

Caution: If the read does not strictly match these exact positions (for instance, if some reads have shorter length or shifted inserts), then using hard-coded positions can cause errors or incorrect trimming.

Method B: Rely on stable flanking sequences

In many protocols, the positions can shift a bit, but each barcode or UMI region is still bounded by stable/fixed adapter sequences. In that scenario, you can specify the left and right flanking adapters that sandwich the barcode or UMI.

For example, assume we know the read has the following layout of R2 (showing only some parts for illustration):

XX AGCGTTGGCTTCTCGCATCT BBBBBBBB ATCCACGTGCTTGAGCGCGCTGCATACTTG BBBBBBBB CCCATGATCGTCCGAAGGCCAGAGCATTCG BBBBBBBB GTGGCCGATGTTTCGCATCGGCGTACGACT UUUUUUUUUU XXXXX

Where:

  • X stands for arbitrary nucleotides (non-barcode)
  • B is the actual barcode
  • U stands for the UMI region
  • The bold segments are stable, known sequences that mark the boundaries before and after each barcode/UMI

By observation:

  • The segment at positions 23–30 is bounded by AGCGTTGGCTTCTCGCATCT (left) and ATCCACGTGCTTGAGCGCGCTGCATACTTG (right). The 8 bp sandwiched between these two adapters is the barcode.
  • Similarly, for positions 61–68, the stable adapters around that 8 bp region are ATCCACGTGCTTGAGCGCGCTGCATACTTG (left) and CCCATGATCGTCCGAAGGCCAGAGCATTCG (right).
  • For positions 99–106, the bounding adapters are CCCATGATCGTCCGAAGGCCAGAGCATTCG and GTGGCCGATGTTTCGCATCGGCGTACGACT.

Hence, you can write:

"StructureBarcode": "AGCGTTGGCTTCTCGCATCT...ATCCACGTGCTTGAGCGCGCTGCATACTTG : ATCCACGTGCTTGAGCGCGCTGCATACTTG...CCCATGATCGTCCGAAGGCCAGAGCATTCG : CCCATGATCGTCCGAAGGCCAGAGCATTCG...GTGGCCGATGTTTCGCATCGGCGTACGACT"

(where each colon : indicates concatenating those three sub-barcodes in order).

If the UMI is at positions 137–146, and it always follows the stable prefix GTGGCCGATGTTTCGCATCGGCGTACGACT, then you can define:

"StructureUMI": "GTGGCCGATGTTTCGCATCGGCGTACGACT_10"

meaning we look for that fixed adapter on the 5' end, and once found, we keep the next 10 bases as the UMI.

This approach gives flexibility in cases where the read length or positions vary slightly, as long as the bounding sequences remain identifiable.

6.3 pre-generated GTF annotation files

ASTRO requires specialized GTF files for its operation. As the construction of these files can be complex, we have pre-generated GTF files for the human (GRCh38) and mouse (mm39) genomes to facilitate efficient analysis. The pre-built files are provided in the Built_GTFs/ directory.

7. Genes2Check Advanced Filtering

If your experiment includes some suspicious genes (for example, piRNA or miRNA), you can list their gene IDs or gene names in a text file (e.g., genes2check.txt) and specify this file in the ASTRO parameters using --genes2check genes2check.txt (or by adding "genes2check": "genes2check.txt" in the JSON). When running Step 4 (Feature Counting), ASTRO will call its built-in advanced detection logic (getvalidedgtf_parallel) to determine whether to discard these suspicious genes based on their over-enrichment in corresponding control regions. If anomalous enrichment is detected, those entries will be removed before formal counting, thereby reducing false positives and yielding a more accurate gene expression matrix.

This feature relies on the BAM file (STAR/tempfiltered.bam) and its index (which will be created automatically if absent) generated in the previous step. It then checks each gene interval in a multithreaded manner. The detection algorithm uses statistical tests (such as Poisson tests), and if it concludes that a gene interval is significantly higher than the background, it deems the interval likely to be junk or a spurious mapping. You can customize the list of genes in genes2check.txt according to your needs (e.g., including lncRNA, miRNA, or piRNA).

Note: The gene names (or IDs) in genes2check.txt should match the entries in the ninth column of your GTF file.

8. legecy bash version

For previous users who want to continue using the bash version of ASTRO, we have put the old bash version in bash folder. However, the legcay bash version is deprecated and we strongly suggest to move to the python version.

9. ASTRO across different technologies

Although ASTRO was designed for spatial whole-transcriptome profiling of FFPE samples, it is a flexible pipeline that can be applied across multiple platforms and technologies. We have tried ASTRO on several datasets to demonstrate this compatibility. Scripts summarizing the usage of ASTRO on these datasets are provided in the compatibility/ folder.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 6