nf-scsajr

Pipeline to quantify alternative splicing in single cell data. It uses segments - part of gene between two adjacent splicing sites - as a feature. For each segment it quantify number of inclusion (confirming inclusion of the segment into RNA) and exclusion UMIs (confirming exclusion of the segment from RNA) in each barcode. Resultant matrices are extremely sparse, so it calcualtes per sample*celltype pseudobulks and uses GLM with quasibinomial distribution to look for differences between celltype. Afterwards it performs GO and interpro domain enrichment analyses and generates html output, coverage plot for up to 100 best events and rds files with counts and pv.

It uses java read counter for read counting and sajr for statistical analyses of alternative. splicing

Input data

Pipeline requires bam files and celltype annotation. Bam files are provided as tsv file with two columns and no header:

sample1 /path/to/bam1
sample2 /path/to/bam2

Celltype annotation specified by another tsv file:

sample barcode celltype

Create reference

nf-scSAJR is distributed with pre-build human 2020A reference. It includes gene descriptions and protein domain annotation. If you are working with non-human species or if you want to use other annotation version you can build th reference from gtf file using:

nextflow main.nf \
 -entry reference \
 --gtf annotation.gtf \
 --outdir <path2reference> \
 -resume

Please keep in mind that this reference will not have domain annotation and gene descriptions so there will be no domain enrichment analyses in pipeline output obtained using this such reference.

Run

nextflow main.nf \
 --SAMPLEFILE samples.tsv \
 --BARCODEFILE barcodes.tsv \
 --outdir sajr_out \
 --ref ref/human_2020A_chr \
 -resume

The pipeline relies on junction reads. So, while technically it can work on any bam with cell barcode (CB bam tag) set, it will hardly detect anything in 3' data while single-nuclei data can be very noisy. It seems to work reasonably well on single cell 5' short reads or on 3' long reads.

TODO

Autodetect strand using part of data (not whole as now)
Autodetect chr/not-chr annotation?
Fail smartly if filtering leaves no segments
Speed up quantification (rewrite it entirely?). Make per-gene? paralelaze?

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
bin		bin
examples		examples
ref/human_2020A_chr		ref/human_2020A_chr
Dockerfile		Dockerfile
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nf-scsajr

Input data

Create reference

Run

TODO

About

Releases 3

Packages

Languages

cellgeni/nf-scsajr

Folders and files

Latest commit

History

Repository files navigation

nf-scsajr

Input data

Create reference

Run

TODO

About

Resources

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages