Skip to content

Input files

Yury V. Malovichko edited this page Nov 6, 2025 · 7 revisions

Mandatory input

2bit genome files

TOGA2 accepts reference and query genome sequences in .2bit format. FASTA format sequences can be converted into .2bit with UCSC faToTwoBit utility:

faToTwoBit in.fasta out.2bit

faToTwoBit, among other UCSC utilities, is downloaded to bin/ folder during TOGA2 installation. Compiled binary can be also downloaded from UCSC Download portal or installed via Conda.

Alignment chain files

Aligning genomes with recommended pipeline

TOGA was developed and tested with pairwise genome alignment data produced by the Hiller Lab LASTZ pipeline. The pipeline produces high-scoring, highly contiguous alignment chains ideal for TOGA annotation, and the list of software requirements (Python, Nextflow, UCSC utilities) overlaps significantly with that of TOGA, facilitating their joint installation and use.

Aligning genomes with custom pipeline

Note

To be expanded

Extracting alignment chains from HAL multiple genome alignment format

Note

To be expanded

Chain file sanity checklist

Below is a non-exhaustive list of potential quirks to check your chain files for. For general chain file format description, consult with the UCSC documentation.

  1. Chain entries must be separated by a blank line (see example in the UCSC documentation). This heuristic is used by chaintools for fast chain file parsing, with its violation not being anyhow mitigated in TOGA2. If your chain file does not contain blank line separators, insert them with the following or similar command:

    sed -i ’1p; /chain/s/^/\n/g’ in.chains
    

    While this requirement might be alleviated in future chaintools releases, we encourage the user to stick to the UCSC format to avoid potential problems.

  2. In the same vein, chaintools struggles with parsing metadata lines. This restriction might be alleviated in future chaintools releases as well, otherwise consider removing those lines from your chain file:

    sed -i ’/^#/d’ in.chains
    
  3. The minimal chain score considered by TOGA2 is regulated by parameters -mcs/--min_chain_score and -minscore/--min_orthologous_chain_score; chains with with score x < min_chain_score are filtered out at the initial TOGA step, and those with score x < min_orthologous_chain_score are not used for annotation of loci other than processed pseudogenes/retrogenes. By default, both parameters are set to 15.000, which filters out most of the spurious low-quality chain for vertebrate genome alignments. If your chains are inherently fragmented and low-scored, however, default filters might lead to important orthology data being lost. To check how chain alignment scores are distributed, extract them from the chain files:

    awk ‘($1 == “chain”){print $2}’ in.chain > scores.txt
    

    There is currently no best recipe for optimal cutoff estimation, but generally lowering the threshold below 1000 is not recommended.

Reference annotation Bed file

Note

To be expanded

Bed file sanity checklist

  1. TOGA2 uses certain special characters as delimiters in the output data, limiting the number of symbols valid for use in input transcripts’ names. The accepted symbols include Latin alphabet letters (upper- and lowercase), digits, dots (‘.’), hyphens (‘-‘), and sharps (‘#’). Transcripts containing any other symbols are discarded at the initial step of TOGA pipeline.
    • Despite being the primary field delimiter in query transcripts’ names, the sharp (‘#’) symbol is accepted in reference transcripts. Annotation files provided with TOGA2 release version contain reference gene names separated from transcript identifiers by sharp (${transcript_id}#${gene_id}) to facilitate results analysis and interpretation, and we highly recommend the users to format transcript names in the similar manner. Note that additional data contained in transcript names are not anyhow used by TOGA2, and gene names added to transcripts do not serve as a substitute for input isoform file.
  2. The prepare-input mode automatically tests the input annotation for the points listed above. Running preparation pipeline on your input data before running TOGA2 annotations is highly recommended.

Converting GTF/GFF files into Bed12 format

Note

To be expanded

Additional input

Isoform table

Note

To be expanded

U12 & non-canonical U2 introns file

Note

To be expanded

CESAR2 profiles

Note

To be expanded

Transcript URL table

Note

To be expanded

TOGA2 prepare-input mode

Clone this wiki locally