Main scripts for the paper:
Direct long-read RNA sequencing identifies a subset of questionable exitrons likely arising from reverse transcription artifacts
The identification of isoforms from long read data is made following the ONT pipeline based on StringTie and other tools
cDNA and directRNA reads are processed independently and using the default parameters of the pipeline.
We use the script candidate_search.R
in the following manner:
Rscript candidate_search.R <input_file>
The input file contains the following lines:
- Library query
- Library target
- path to gffcompare tracking file on the form: query_target.tracking
- path to gffcompare tracking file on the form: target_query.tracking
- path to the query GFF annotation file
- path to the target GFF annotation file
- path to the query BAM file
- path to the target BAM file
- output path including prefix to use on the output files
An example of this file is contained here.
query = Library with potential artifacts (usually cDNA)
target = Library to compare (usually dRNA)
The script will output the candidates under different filters, as described in the figure above.
The script repeat_search.R
is used to process the file with the filter F3 in order to search for direct repeats in the candidates.