Skip to content
Jonas Ibn-Salem edited this page Dec 23, 2020 · 8 revisions

Welcome to the EasyFuse wiki!

EasyFuse is a pipeline to efficiently detect fusion transcripts from RNA-seq data with high accuracy. EasyFuse uses five fusion gene detection tools, STAR-Fusion, InFusion, MapSplice2, Fusioncatcher, and SoapFuse along with powerful read filtering, stringent re-quantification of candidates, and machine learning for highly specific and sensitive fusion gene prediction.

A manuscript describing the method and performance evaluations is submitted for peer-review and publication.

Installing EasyFuse

Follow this link for instructions on Installing EasyFuse

Running EasyFuse

Using paired end FASTQ Files, run EasyFuse in the following way:

processing.py -i /path/to/directory/of/fastq/files \
              -o /path/to/output/directory \
              -c /path/to/config/file \
              -u USERNAME \
              -p SLURM-Partition

Processing multiple samples with paired FASTQ files is possible. In any case, FASTQ files need to be

  • in the same folder
  • possess the same base name for each pair
  • possess unique names for each sample

RunningEasyFuse

EasyFuse will automatically create a folder structure

Restarting EasyFuse

EasyFuse will generate a database samples.db, saving the progress and successfully completed steps. If this database exists, EasyFuse can be restarted using the same command as above and will resume processing from the last successfully completed step.

EasyFuse Config

In order to change tool configuration, paths to indices, or remove certain steps from the pipeline the config file can be edited.

[general]
tools=QC,Readfilter,Fusioncatcher,Star,Starfusion,Infusion,Mapsplice,Soapfuse,Fetchdata,Summary
fusiontools=Fusioncatcher,Starfusion,Infusion,Mapsplice,Soapfuse
fd_tools=Fusiongrep,Contextseq,Starindex,ReadFilter2,ReadFilter2b,StaralignBest,BamindexBest,RequantifyBest
cis_near_distance=1000000
model_pred_threshold=0.75
tsl_filter=4,5,NA
requant_mode=best
context_seq_len=400
ref_genome_build=hg38
ref_trans_version=ensembl
queueing_system=slurm

tools

QC: FASTQ-QC and trimming to ensure high quality reads using FastQC and skewer
Readfilter: Read filtering step to remove normal mapping reads
Fusioncatcher,Star,Starfusion,Infusion,Mapsplice,Soapfuse: Alignment and fusion detection tools (STAR and at least one tools is necessary to run the pipeline)
Fetchdata: Module to parse all outputs in a similar format, calculate context sequences around breakpoint and their translated peptide sequence, realign those sequences for quantification
Summary: Summarize data to final output, needs Fetchdata to finish successfully

fusiontools

Employed fusion tools in pipeline run (needs to be the same as in tools above)

fd_tools

It is not recommended to change this line, EasyFuse will not work as intended!
Fusiongrep: Parses output from detection tools into Detected_Fusions.csv
Contextseq: Calculates context_sequences and annotates fusion genes into Context_Seq.csv
Starindex: Generates STAR-index from context sequences
ReadFilter2: Generates alignment from
ReadFilter2b: Generates FASTQs from bam file
StaralignBest: Aligns FASTQs to context sequence star index
BamindexBest: Indexes resulting bam file from StaralignBest
RequantifyBest: Calculates mapping reads per 100 million reads

other

It is not recommended to change these entries, EasyFuse will not work as intended!
cis_near_distance: Distance between neighboring genes to be qualified as "cis_near" when detected as fusion
model_pred_threshold:
tsl_filter: Threshold for transcripts to be filtered out by tsl_level
requant_mode:
context_seq_len: Length of context sequence from breakpoint in either direction
ref_genome_build: Version of reference genome (e.g. hg38, hg19 etc.). It is strongly recommended to use hg38.
ref_trans_version: Transcript version (only Ensembl supported)
queueing_system: Which queueing system to use (only SLURM supported)

[references]
ensembl_genome_fasta_hg38=
ensembl_genome_fastadir_hg38=
ensembl_genome_sizes_hg38=
ensembl_genes_fasta_hg38=
ensembl_genes_gtf_hg38=
ensembl_genes_adb_hg38=
ensembl_genes_tsl_hg38=

[indices]
ensembl_star_hg38_sjdb49=/projects/data/human/ensembl/GRCh38.86/STAR_idx/
ensembl_bowtie_hg38=/projects/data/human/ensembl/GRCh38.86/bowtie_index/hg38
ensembl_starfusion_hg38=/projects/data/human/ensembl/GRCh38.86/starfusion_index/
ensembl_fusioncatcher_hg38=/projects/data/human/ensembl/GRCh38.86/fusioncatcher_index/


[otherFiles]
ensembl_infusion_cfg_hg38=/projects/data/human/ensembl/GRCh38.86/infusion_index/infusion.cfg
ensembl_soapfuse_cfg_hg38=/code/SOAPfuse/1.27/config/config_h86.txt
easyfuse_model=/code/easyfuse/1.3.0/data/model/Fusion_modeling_IVAC_BNT_v16.model.requant_and_boundary_org.randomForest.rds

references

All paths to the references must be absolute paths. Examples are given for Ensembl86
ensembl_genome_fasta_hg38: Fasta file containing complete genome (Homo_sapiens.GRCh38.dna.primary_assembly.fa)
ensembl_genome_fastadir_hg38: Directory with single fasta from each chromosome
ensembl_genome_sizes_hg38: Genome sizes calculated by STAR (chNameLength)
ensembl_genes_fasta_hg38: cDNA file containing all ensembl transcripts (Homo_sapiens.GRCh38.cdna.all.fa)
ensembl_genes_gtf_hg38: Ensembl gtf file (Homo_sapiens.GRCh38.86.gtf)
ensembl_genes_adb_hg38: gff in database form (Homo_sapiens.GRCh38.86.gff3)
ensembl_genes_tsl_hg38: blacklist of known low tsl level transcripts, based of Ensembl gtf (Homo_sapiens.GRCh38.86.gtf)

indices

Absolute Paths to complete indices, based on Ensembl86
ensembl_star_hg38_sjdb49: Path to STAR index
ensembl_bowtie_hg38: Path to bowtie index
ensembl_starfusion_hg38: Path to STAR-Fusion index
ensembl_fusioncatcher_hg38: Path to Fusioncatcher index

otherfiles

Single config files containing options/indices for certain tools/modules
ensembl_infusion_cfg_hg38: Path to InFusion config
ensembl_soapfuse_cfg_hg38: Path to SoapFuse config
easyfuse_model: Path to .rds file for EasyFuse model

Clone this wiki locally