Skip to content

ConesaLab/courses-SQANTI_verse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SQANTI verse tutorial

git clone https://github.com/ConesaLab/courses-SQANTI_verse.git

Introduction to SQANTI3

SQANTI3 is a comprehensive bioinformatics tool designed for the quality control, curation, and annotation of long-read transcriptomes. It integrates multiple data sources to enhance the accuracy of transcriptome characterization, making it an essential component of the Functional IsoTranscriptomics framework, alongside tools like IsoAnnot and tappAS. (GitHub Repository)

Key Features

  • Integration of Diverse Data Types: SQANTI3 supports the incorporation of various data types, including:

    • Short-read sequencing data
    • CAGE (Cap Analysis Gene Expression) peak sequencing information
    • PolyA motifs and Quant-seq
    • Expression matrices derived from tools like Kallisto
    • Full-length counts from PacBio data
  • Modular Architecture: The tool is structured into three main modules:

    1. QC (Quality Control): Classifies isoforms based on structural categories.
    2. Filter: Applies user-defined criteria to remove potential artifacts.
    3. Rescue: Recovers isoforms that may have been erroneously filtered out but have supporting evidence.

Structural Categories in QC Module

The QC module categorizes isoforms into well-defined structural categories, facilitating a nuanced understanding of transcriptome composition:

πŸ—οΈ SQANTI3 Structural Categories
  1. Full-Splice Match (FSM): Isoforms with splice junctions that perfectly match a reference transcript.
  2. Incomplete-Splice Match (ISM): Isoforms matching a consecutive subset of splice junctions from a reference transcript.
  3. Novel In Catalog (NIC): Isoforms with novel combinations of known splice sites.
  4. Novel Not In Catalog (NNC): Isoforms containing at least one novel splice site not present in the reference annotation.
  5. Antisense: Isoforms aligning to a gene locus but transcribed from the opposite strand.
  6. Fusion: Isoforms comprising exons from two or more distinct gene loci.
  7. Genic Genomic: Isoforms located within a gene locus but not reconstructing any known or novel splicing pattern.
  8. Intergenic: Isoforms aligning to genomic regions outside of any annotated gene.

Course Objectives

Throughout this course, you will gain hands-on experience with SQANTI3's three main modules:

  • QC: Learn how to classify isoforms and interpret quality descriptors.
  • Filter: Understand how to apply filtering criteria to refine your transcriptome data.
  • Rescue: Explore methods to recover valid isoforms that may have been inadvertently excluded.

By the end of the course, you will be equipped to effectively utilize SQANTI3 for comprehensive transcriptome analysis, integrating various data types to achieve accurate and reliable results.

⚠️ This course has been done using the latest release of SQANTI3 (v5.4).

0. Pre-requisites

Before starting the tutorial, it is key to have a clean and organized working environment. The initial step, even before processing any data is to prepare the working environment. In bioinformatics, an organized workspace is vital, so when you come after some time to your project, you can find and understand what you were doing, rather than spend hours searching through weirdly named directories. It is important to always create three directories:

  • scripts: all the scripts will be stored here, with meaningful names
  • data: Raw data will go in here and, if you want and need, databases
  • results: Create a sub directory for every different process you do. If you run a process multiple times with different parameters, include them in the directory name, so you will differentiate them in the future.
mkdir scripts
mkdir data

Software installation

SQANTI3 can be directly downloaded from GitHub using the following command:

wget https://github.com/ConesaLab/SQANTI3/releases/download/v5.4/SQANTI3_v5.4.zip
mkdir -p tools/sqanti3
unzip SQANTI3_v5.4.zip -d tools/sqanti3

Once it is downloaded and unzipped, you can either add it to your PATH or call the programs using the full path.

πŸ“ƒ Adding SQANTI3 to your PATH

In the terminal, you have to find the file .bashrc or .bash_profile in your home directory. Use your favorite text editor to open it and add the following line

export PATH=$PATH:/path/to/SQANTI3_v5.4

Then, save the file and SQANTI3 will be in your path for the next terminal you open. As well, you can make this changes instantaneous by running source ~/.bashrc or source ~/.bash_profile in the terminal.



The final step to have a functional sqanti3 installation is to install the dependencies. SQANTI3 has a few dependencies that need to be installed before running the program, but that is all handled by anaconda or mamba, depending on which one you prefer (mamba is highly recommended for better environment solving and consistency).

mamba env create -f tools/sqanti3/SQANTI3.conda.env.yml

Data downloading

All the data needed for the tutorial can be found in the data directory of this repository. It contains the long-read defined transcriptome, the reference genome, annotation and the orthogonal data used in the tutorial. For the sake of simplicity and time, only the isoforms that are part of the chromosome 22 of humans will be used, which are more than enough to go through all the SQANTI3 functionalities. If you wish to learn more about the data origin, you can check it out in the SQANTI3 paper.

1. SQANTI3 QC

The first step of the suite, and where the SQANTI verse begins in the Quality Control (QC) module. This module is designed to assess the quality of a transcriptome, and integrate multiple kinds of orthogonal data that might help to understand and determine what are the true isoforms. As an end result, SQANTI3 QC will take as input the target transcriptome and the reference genome and annotation. The user can optionally add other data sources, such as short-reads RNA-seq data or CAGE peaks, to include more parameters that will be used in downstream analysis. The QC module will parse all of this information and produce a report and a classification on the given isoforms based on the structural categories defined in the SQANTI3 paper.

As well, SQANTI3 QC is able to determine CDS regions, using TransDecoder2 as predictor for these parts of the transcriptome, or even receive the isoforms in fasta format. SQANTI will map them against the reference genome and produce a gtf file to run with.

1.1. Basic run

Firstly, to get familiar with SQANTI3 QC, we will run it with the most basic parameters. The only required parameters are the input transcriptome, the reference genome and the reference annotation. The rest of the parameters are optional, but they will be explained in the next sections. The way the inputs have to be given is as follows:

  1. --isoforms: The input transcriptome. This can be a fasta file or a gtf/gff3 file. If you are using a fasta file, the --fasta flag has to be included. This will allow SQANTI3 to parse the input file correctly and map the reads against the genome to produce the gtf.

  2. --refGTF: Reference annotation in gtf format.

  3. --refFasta: Reference genome in fasta format.

⚠️ Special considerations

SQANTI3 was developed to work with PacBio long-reads transcriptomes. Thus, when parsing transcriptomes it will parse PacBio like transcripts and gene IDs. If your data does not come from a PacBio experiment, you will need to add the --ignore flag to allow the usage of transcript IDs that do not follow the PacBio format (PB.X.Y)



sqanti3_qc.py \
    --isoforms data/raw_transcriptome.gtf \
    --refGTF data/chr22_hg38.gtf \
    --refFasta data/chr22_hg38.fasta \
    --dir results/basic_sqanti3 --output course 

In this run, SQANTI3 will be carried out in its most basic functions. First, it will parse all inputs and load them. Then, it will predict the Open Reading Frames within the contigs, to determine if the isoforms are coding or not. It will also look at the Retrotranscriptase Switching and the percentage of A content after the TTS. Finally, it will produce the classification of the isoforms in the SQANTI3 categories. Now, lets, dive into the output files.

πŸ“€ Output files

The output files are stored in the directory results/basic_sqanti3/course. In this directory, you will find a few files and directories. The most important ones are:

  • course_corrected.gtf: The corrected GTF file. This file contains the parsed input isoforms, eliminating malformed lines from the GTF and correcting possible errors from the isoforms if they were given as a fasta files.

  • course_corrected.fasta: The corrected fasta file. This file contains the parsed input isoforms, eliminating malformed lines from the GTF and correcting possible errors from the isoforms if they were given as a fasta files. The sequence for the isoforms is directly taken from the reference genome (thus eliminating possible SNPs). These files will be the ones used by other SQANTI3 modules, rather than the original input files.

  • course_corrected.genePred: The corrected transcriptome in genePred format, since some steps of SQANTI3 require this format for compatibility with the orthogonal data.

  • course_corrected.cds.gff3: This file is a version of the corrected gtf that includes the predicted CDS regions. This file will be created if the option --skipORF is not included.

  • course_classification.txt: The classification file. This file contains the classification of the isoforms in the SQANTI3 categories. This file is the most important output of SQANTI3 QC, as it contains the information about the quality of the transcriptome and the classification of the isoforms.

  • course_junctions.txt: A tab-separated file with information at the junction level for all transcriptomes included in the classification file. Each row represents a specific junction and includes details such as genomic coordinates, whether it is a canonical junction (e.g., GT-AG, GC-AG, AT-AC) or non-canonical, and whether it is known (present in the reference annotation) or novel.

  • course_SQANTI3_report.html: The SQANTI3 report. This file contains a summary of the results and the classification of the isoforms. It is an HTML file that can be opened in any web browser. The report contains a summary of the results, including the number of isoforms, the number of genes, the number of junctions, and the classification of the isoforms. It also contains plots and figures that help to visualize the results.

  • course.qc_params.txt: The QC parameters file. This file contains the parameters used in the run, including the input files, the reference genome and annotation, and the options used in the run. This file is useful to keep track of the parameters used in the run and to reproduce the results.



πŸ“Š SQANTI3 concepts

Canonical and Non-Canonical Junctions

  • Canonical junctions are the dinucleotide pairs most commonly found at the ends of introns and are efficiently recognized by the splicing machinery. The most common pairs considered canonical are GT-AG, GC-AG, and AT-AC. The GT-AG combination is the most abundant in the human genome, representing around 98.9% of introns. Together, these three canonical combinations are found in over 99.9% of human introns. By default, SQANTI3 considers these three pairs as canonical but allows users to define their own set of canonical junctions using the --sites parameter.

  • Non-canonical junctions include all other dinucleotide combinations at intron boundaries that are not considered canonical. These junctions are much less frequent and are often associated with splicing errors or artifacts.

  • Detection by SQANTI: SQANTI3 analyzes transcript sequences and compares the dinucleotides present at the ends of each intron (defined by donor and acceptor splice sites) with the set of canonical junctions.

    • In the classification file (classification.txt), the all_canonical column indicates whether all junctions in an isoform have canonical splice sites.
    • In the junctions file (junctions.txt), the splice_site column shows the specific splicing motif of each junction. The start_site_category and end_site_category columns indicate whether the start and end sites of the junction are annotated as "known" in the reference annotation file. The junction_category column shows whether the donor-acceptor combination is "known" or "novel".
    • SQANTI3 calculates and reports the proportion of canonical and non-canonical junctions across different transcript categories to help identify possible artifacts. For instance, a high proportion of non-canonical junctions in certain novel transcript categories (such as NNC) may suggest a higher likelihood of being artifacts.

RT Switching (Reverse Transcriptase Template Switching)

  • RT switching is a phenomenon that occurs during reverse transcription, where the reverse transcriptase (RT) may prematurely switch RNA templates during cDNA synthesis. This can happen due to secondary structures in the messenger RNA or the presence of direct repeats in the RNA sequences. RT switching can generate spurious cDNA that is misinterpreted as splicing events, often resulting in non-canonical junctions. RT switching events have been observed to be enriched in low-abundance transcripts from highly expressed genes.

  • Detection by SQANTI3: SQANTI3 implements an algorithm to predict potential RT switching artifacts. The algorithm scans all junctions (both canonical and non-canonical) for direct repeat patterns at defined sequence locations characteristic of RT switching.

    • In the classification file (classification.txt), the RTS_stage column is set to TRUE if one of the isoform's junctions may be an RT switching artifact.
    • In the junctions file (junctions.txt), the RTS_junction column is set to TRUE if the junction is predicted to be a template-switching artifact.
    • SQANTI3 analyzes the frequency of RT switching predictions across different transcript categories and junction types to help users identify potential artifacts introduced during cDNA library preparation. For example, a higher proportion of RT switching predictions in NNC transcripts may indicate that these are more likely to be artifacts.

Nonsense-Mediated Decay (NMD)

  • NMD is a cellular surveillance mechanism that degrades mRNA transcripts containing premature stop codons (nonsense mutations) to prevent the production of truncated and potentially harmful proteins. NMD can also target long 3' UTRs or aberrant splicing events.
  • Detection by SQANTI3: SQANTI3 does not directly predict NMD events but provides information on the potential for NMD in the classification file. The 'predicted_NMD' column will be flagged as TRUE if the isoform has a predicted end of the CDS at least 50bp before the last junction.

By identifying and characterizing canonical and non-canonical junctions, as well as potential RT switching artifacts or possible Nonsense-mediated decay, SQANTI3 supports quality control and curation of long-read transcriptomes, enabling more accurate identification of real isoforms.



For now, we will focus on the main output from SQANTI, the classification file (you can get more information about the other output on the tab above). This file contains 48 columns with information about the isoforms and the results from the run. Two important columns are the structural_category, which indicates the structural category that each isoform belongs to, and associated_gene, which indicates the reference gene that a certain isoform is associated with (if any).

❓Trivia: Take a look at SQANTI3 report (open it in Google Chrome or Firefox). How many isoforms are classified as FSM? How many are NNC?

πŸ₯³ You have completed a basic run of SQANTI3 QC, and you have learned how to run it and what are the main output files. In the next sections, we will explore the additional inputs that SQANTI3 QC is able to integrate, and how to use them to improve the results of the classification.

1.2 Run with additional inputs

The most common extra information that you will use in combination with lr RNA-seq is short reads RNA data. While long-reads are fabulous for determining the structure of an isoform, since they are able to capture full transcripts without the need of an assembly, they tend to be quite noisy and with high error rates, especially if they come from 'old' chemistries. That is why, combining them with short reads is such a good idea. There reads will map to the transcriptome and can be used for multiple purposes, such as identifying isoform expression, fixing indels or giving support to junctions.

The other two main types of orthogonal data that SQANTI3 QC can use are CAGE peaks and polyA sites. CAGE peaks are used to identify the transcription start sites (TSS) of the isoforms, while polyA sites are used to identify the transcription termination sites (TTS) of the isoforms. These two types of data can be used to improve the classification of the isoforms, and to distinguish between true isoforms and artifacts caused by degradation of the RNA, for instance.

sqanti3_qc.py \
    --isoforms data/raw_transcriptome.gtf \
    --refGTF data/chr22_hg38.gtf \
    --refFasta data/chr22_hg38.fasta \
    --short_reads data/short_reads.fofn \
    --CAGE_peak data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed \
    --polyA_motif_list data/polyA_motifs/mouse_and_human.polyA_motif.txt \
    --dir results/complete_sqanti3 --output course

When it comes to the output files, they won't change much form a SQANTI3 run without the extra information. The main difference will be within the report, where some of the columns that were NAs before, now will be filled with the information from the short-reads, CAGE peaks and polyA motifs.

πŸ₯³ You have completed a full run of SQANTI3 QC, the first and central module of the SQANTI-verse. You have learned how to run it, what are the main output files and how to use additional information to improve the classification of the isoforms. Now, it is your time to show all you have learned about SQANTI3 QC. Try to complete the qc worksheet using your knowledge, the data you just generated and a little help from the SQANTI3 wiki and report πŸ˜‰.

In the next sections, we will explore the other modules of SQANTI3, which are designed to curate and filter the transcriptome based on the results from SQANTI3 QC.

alt text

2. SQANTI3 filter

The SQANTI3 filter module provides two primary technical approaches for curating long-read transcriptomes by removing potential artifacts: a rules-based filter and a machine learning-based filter. The rules filter operates by applying user-defined criteria, expressed in a JSON format, to the SQANTI3 QC classification file (*_classification.txt). These rules specify characteristics that reliable isoforms should possess, considering various QC attributes. The machine learning filter, on the other hand, employs a random forest classifier trained on true positive (TP) and true negative (TN) isoform sets, leveraging SQANTI3 QC attributes to predict the probability of a transcript being a genuine isoform or an artifact. Both filtering methods can generate a filtered classification file and optionally filter associated FASTA/FASTQ, GTF, SAM, and FAA files based on the identified high-quality isoforms.

For the sake of this tutorial, we will focus on the rules-based filter, which allows for more flexibility and customization from the user. The first step here is to understand how the filter works.

2.1 Rules

The rules filter is based on a set of rules (surprise!) that are defined in a JSON file. SQANTI3 filter comes by default with its own set of rules, which do a really basic filtering of the isoforms. The way the rules work is the following:

  1. Create a set of rules for a specific structural category (or for all under the tag 'rest').
  2. The rules are treated as OR conditions, meaning that if an isoform passes any of the rules, it will be kept.
  3. Within a rule, the user can set multiple conditions, which are based on the attributes of the SQANTI3 QC classification file. These conditions are treated as AND conditions, meaning that all conditions within a rule have to be met for the isoform to be kept.

The conditions included must have the same name as the columns in the classification file, and you can use 3 possible value types:

  1. A single integer to denote the minimum value
  2. A range of integers, using the format [min,max] to denote the range of values.
  3. A string or a list of strings to denote a specific value or a list of values.
πŸ“œ Example of rules
  • FSM: Keep if the isoform is:
    • NOT a potential intrapriming product (less than 60% in the perc_A_downstream_TTS column). OR
    • It is larger than 2kb AND has more than 1 exon
  • ISM: Keep if:
    • It is larger than 2kb and shorter than 15kb AND
    • It is cataloged within the "3prime_fragment", "5prime_fragment" or "internal_fragment" subcategories.
{
    "full-splice_match": [
        {
            "perc_A_downstream_TTS":[0,59]
        },
        {
            "length": 2001,
            "exon_count": 2
        }
    ],
    "incomplete-splice_match":[
        {
            "length":[2001,14999],
            "subcategory": ["3prime_fragment", "5prime_fragment", "internal_fragment"]
        }
    ]
}

Note: In case of a condition being added but the column being an NA (not having information in the classification file), the rule check will automatically fail. This is important to keep in mind when creating the rules, as you might accidentally add a condition that you did not analyze and cause all of your isoforms to fail because of it.

❓ Trivia: Now, try to create your own set of rules for the following conditions:

  • FSM: Keep if the isoform:

    • Has a perc_A_downstream_TTS lower than 60% (not likely intrapriming) AND
    • Has no RT Switching
  • ISM: Keep if:

    • Has no RT Switching AND
    • It belongs to subcategories "3prime_fragment" or "5prime_fragment".
  • NIC: Keep if:

    • The junctions are all canonical OR
    • The minimum coverage (min_cov) is greater than or equal to 5.
  • NNC: Keep if:

    • The junctions are all canonical AND
    • Distance to the closest TSS (diff_to_gene_TSS) and TTS (diff_to_gene_TTS) is within Β±50 bp.
  • Rest of the categories: Keep if:

    • The transcript is coding AND
    • It is not an intrapriming artifact AND
    • It has at least 2 exons AND
    • The splice junctions are canonical AND
    • It is not an RTS artifact.
πŸ€” Solution
{
  "full-splice_match": [
    {
      "perc_A_downstream_TTS": [0, 59],
      "RTS_stage": false
    }
  ],
  "incomplete-splice_match": [
    {
      "length": [2000, 15000],
      "subcategory": ["3prime_fragment", "5prime_fragment"]
    }
  ],
  "novel_in_catalog": [
    {
      "all_canonical": "canonical"
    },
    {
      "min_cov": 5
    }
  ],
  "novel_not_in_catalog": [
    {
      "all_canonical": "canonical",
      "diff_to_gene_TSS": [-50, 50],
      "diff_to_gene_TTS": [-50, 50]
    }
  ],
  "rest": [
    {
      "coding": "coding",
      "perc_A_downstream_TTS": [0, 59],
      "exons": 2,
      "all_canonical": "canonical",
      "RTS_stage": false
    }
  ]
}

Now, that you know the basics of the rules filter, lets dive into running it. The rules module comes with two different subparsers, one for the rules strategy and another for the machine learning strategy. Most of the inputs given are common, and in our case, the only special input will be the JSON file with the actual rules. For the sake of this tutorial, we will use one rule file for the two classifications we did, the one without the orthogonal data and the one with it.

sqanti3_filter.py rules \
    --sqanti3_class <path_to_sqanti3_dir>/course_classification.tsv \
    --filter_gtf <path_to_sqanti3_dir>/course_corrected.gtf \
    --json_filter data/rules.json \
    --dir results/rules_filter --output course 

You will have to run this command twice, once for each classification file. Remember to also change the output directory, so one run won't overwrite the other. Once both runs are finished, move to the questionnaire filter_worksheet.md and try to answer the questions. You can use any programming language or tool to answer the questions. The questions are designed to help you understand the output of SQANTI3 filter and the classification file.

πŸ₯³ You successfully completed your first runs of SQANTI3 filter. Congrats! You are now one step closer to become an SQANTI3 expert!

3. SQANTI3 rescue

The final module of the main SQANTI3 pipeline is rescue. The aim of this module is to prevent the loss of transcripts from long-read data that might be biologically relevant but could be discarded due to not being accurately processed. In the end, because of degradation form an isoforms or issues during sequencing, we might not fully capture and during filtering consider the transcript to be an artifact. SQANTI3 rescue offers two different approaches towards the isoform recovery:

  1. Automatic rescue: This first approach is run always, and by default. It focuses on recovering high-confidence reference isoforms by identifying reference transcripts for which all corresponding Full Splice Match (FSM) isoforms were removed during the filtering stage.

  2. Full rescue: After the automatic rescue of the FSM transcripts, the rest of structural categories undergo rescue. For that, they are considered to be candidates to be rescued, and the targets are both the reference isoforms that they have assigned, plus the respective long-read transcripts.

The rescued transcriptome will include the rescued isoforms, ensuring that no redundancy is introduced. Meaning that if a reference transcript is going to be reintroduced, SQANTI3 rescue will always check first if it is already present in the transcriptome.

Note: The full rescue pipeline will slightly change based on the filter strategy applied, since the reference transcriptome has to be processed following the same strategy as the long-reads transcriptome

In order to run SQANTI3 rescue in full mode, you need to have run SQANTI3 QC on the reference annotation against itself, using the same orthogonal data as in your dataset. This step has already been done for you, but if you want to test yourself, go ahead and try to run it πŸ˜‰.

Code to run SQANTI3 QC on the reference ```bash sqanti_qc.py \ --isoforms data/chr22_hg38.gtf \ --refGTF data/chr22_hg38.gtf \ --refFasta data/chr22_hg38.gtf \ --short_reads data/short_reads.fofn \ --CAGE_peak data/ref_TSS_annotation/human.refTSS_v3.1.hg39.bed \ --polyA_motif_list data/polyA_motifs/mouse_and human.polyA_motif.txt \ --dir reference_sqanti3 --output reference ```

In this tutorial we will run the full rescue using the results from filtering the long-read transcriptome that had the orthogonal data integrated. Since we are working with a small dataset, we will run the full rescue, and compare the results from the automatic rescue, which is always run first, with those of the full rescue.

The command is as follows:

sqanti3_rescue.py rules \
    --filter_class results/complete_filter/course_RulesFilter_result_classification.txt \
    --refGTF data/reference/gencode.v38.basic_chr22.gtf \
    --refFasta data/reference/GRCh38.p13_chr22.fasta \
    --refClassif data/reference/gencode.v38.basic_chr22_classification.txt \
    --mode full \
    --json_filter data/filter_rules.json \
    --rescue_isoform results/complete_sqanti3/course_corrected.fasta \
    --rescue_gtf results/complete_filter/course.filtered.gtf \
    --dir results/complete_rescue --output course

⚠️ In --rescue_gtf you have to specify the filtered gtf file, not the corrected one, otherwise the artifact isoforms will be kept after rescue. However, in rescue_isoforms you do need to use the non-filtered fasta file with all the isoforms. That way, the mapping step can be done correctly.

This pipeline is a bit more complex. It follows four main steps to rescue the isoforms:

  1. Candidate and target selection: Candidate isoforms are the ISM, NIC and NNC isoforms that were considered as artifacts in the filtering step. The rescue targets are all the reference or long read-defined transcripts that are associated with the candidate isoforms.

  2. Maping of the candidates to the targets: Minimap2 is used to map the candidates to the targets and find the best matches between each target and the same gene candidates.

  3. Filtering of the reference isoforms: Using the same strategy as with the long-read transcriptome, the reference isoforms are run through the SQANTI3 filter module. This step is done to validate the reference transcripts using the orthogonal data.

  4. Rescue of the isoforms: In the final step, the rescue-by-mapping takes place. The mapping hits (candidate-target mapping pairs) are run through a series of criteria:

    • If the rescue target did not pass the filter, the candidate-target will be removed.
    • Long read transcripts will be discarded if there is another mapping hit for the same candidate isoform to a target that is from the reference transcriptome.
    • If the candidate isoform is already present in the transcriptome as a FSM or has already been rescued in the automatic part, it will be discarded to avoid redundancy.

The output in this case will be the same as before, but most likely, more transcripts will be rescued. To finish with the SQANTI3 tutorial, lets go and complete the last worksheet rescue_worksheet.md. The questions are designed to help you understand the output of SQANTI3 rescue and why some isoforms were rescued and others not.

alt_text

5. SQANTI3 wrapper

As of release v5.4, a new wrapper script and configuration file for SQANTI3 have been added. This wrapper is designed to simplify the process of running SQANTI3 and its associated modules. The wrapper script allows users to run all of SQANTI3 modules (qc, filter and rescue), specifying the input files, output directories, and all the parameters in a single configuration file, making it easier to manage and run multiple analyses.

There are 5 possible ways to run the wrapper:

  1. initcreates the configuration file
  2. all runs all the selected modules of SQANTI
  3. qc runs the SQANTI3 QC module
  4. filter runs the SQANTI3 filter module
  5. rescue runs the SQANTI3 rescue module

In order to facilitate the creation of the configuration file, to include all of the needed parameters, the wrapper has a init module that creates the configuration with SQANTI3 default values:

sqanti3 init -c <config_file>

If the user desires to set some predefined parameters at this step, it can be done by including them via the option -a, followed of as many combinations of param=value as needed. This option can also be used to overwrite the parameters in the config files between runs.

Ex: Imagine you want to rerun SQANTI3 QC changing only one parameter (the isoforms input) but keeping the rest of files the same. This can be done by running the following command:

sqanti3 qc -c <config_file> -a isoforms=<path_to_isoforms> dir=run_2

In the configuration file, each module options are preceeded by the enabled tag. Setting it to true or false will activate or deactivate each module respectively when running the wrapper in all mode.

Finally, there are two new options:

  1. --dry-run: This option will run the wrapper in dry-run mode, meaning that it will not execute any of the modules, but will print the commands that would be executed
  2. --log_level: Determines the amount of log that SQANTI3 will produce. By default it is set to INFO, showing the main information. If it is too much, it can be reduced by changing it to ERROR or WARNING as desired.

BONUS

Lets see who wins the SQANTI3 challenge!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published