-
Notifications
You must be signed in to change notification settings - Fork 0
Input files
TOGA2 accepts reference and query genome sequences in .2bit format. FASTA format sequences can be converted into .2bit with UCSC faToTwoBit utility:
faToTwoBit in.fasta out.2bitfaToTwoBit, among other UCSC utilities, is downloaded to bin/ folder during TOGA2 installation. Compiled binary can be also downloaded from UCSC Download portal or installed via Conda.
TOGA was developed and tested with pairwise genome alignment data produced by the Hiller Lab LASTZ pipeline. The pipeline produces high-scoring, highly contiguous alignment chains ideal for TOGA annotation, and the list of software requirements (Python, Nextflow, UCSC utilities) overlaps significantly with that of TOGA, facilitating their joint installation and use.
Note
To be expanded
Note
To be expanded
Below is a non-exhaustive list of potential quirks to check your chain files for. For general chain file format description, consult with the UCSC documentation.
-
Chain entries must be separated by a blank line (see example in the UCSC documentation). This heuristic is used by chaintools for fast chain file parsing, with its violation not being anyhow mitigated in TOGA2. If your chain file does not contain blank line separators, insert them with the following or similar command:
sed -i ’1p; /chain/s/^/\n/g’ in.chainsWhile this requirement might be alleviated in future
chaintoolsreleases, we encourage the user to stick to the UCSC format to avoid potential problems. -
In the same vein, chaintools struggles with parsing metadata lines. This restriction might be alleviated in future
chaintoolsreleases as well, otherwise consider removing those lines from your chain file:sed -i ’/^#/d’ in.chains -
The minimal chain score considered by TOGA2 is regulated by parameters
-mcs/--min_chain_scoreand-minscore/--min_orthologous_chain_score; chains with with score x <min_chain_scoreare filtered out at the initial TOGA step, and those with score x <min_orthologous_chain_scoreare not used for annotation of loci other than processed pseudogenes/retrogenes. By default, both parameters are set to 15.000, which filters out most of the spurious low-quality chain for vertebrate genome alignments. If your chains are inherently fragmented and low-scored, however, default filters might lead to important orthology data being lost. To check how chain alignment scores are distributed, extract them from the chain files:awk ‘($1 == “chain”){print $2}’ in.chain > scores.txtThere is currently no best recipe for optimal cutoff estimation, but generally lowering the threshold below 1000 is not recommended.
Note
To be expanded
- TOGA2 uses certain special characters as delimiters in the output data, limiting the number of symbols valid for use in input transcripts’ names. The accepted symbols include Latin alphabet letters (upper- and lowercase), digits, dots (‘.’), hyphens (‘-‘), and sharps (‘#’). Transcripts containing any other symbols are discarded at the initial step of TOGA pipeline.
- Despite being the primary field delimiter in query transcripts’ names, the sharp (‘#’) symbol is accepted in reference transcripts. Annotation files provided with TOGA2 release version contain reference gene names separated from transcript identifiers by sharp (
${transcript_id}#${gene_id}) to facilitate results analysis and interpretation, and we highly recommend the users to format transcript names in the similar manner. Note that additional data contained in transcript names are not anyhow used by TOGA2, and gene names added to transcripts do not serve as a substitute for input isoform file.
- Despite being the primary field delimiter in query transcripts’ names, the sharp (‘#’) symbol is accepted in reference transcripts. Annotation files provided with TOGA2 release version contain reference gene names separated from transcript identifiers by sharp (
- The
prepare-inputmode automatically tests the input annotation for the points listed above. Running preparation pipeline on your input data before running TOGA2 annotations is highly recommended.
Note
To be expanded
Note
To be expanded
Note
To be expanded
Note
To be expanded
Note
To be expanded