Top level files

Gene annotation

query_annotations.bed

The ultimate TOGA2 annotation output, containing all projections annotated by TOGA2 in BED12 format. This file contains:

a final set of annotated orthologous projections, including those classified as Lost or Missing;
paralogous projections for transcripts which ended up without predicted orthologs in the query;
(if --annotate_processed_pseudogenes flag was set:) processed pseudogene projections with loss status of Intact or Fully Intact

chr4	120267916	120287252	NM_001291281.3#FOXO6#5	0	-	120267916	120287252	0,0,100	3	475,593,414,	0,673,18922,

BED12 fields are:

chrom: name of the chromosome (contig, scaffold, etc.) the projection was annotated in (chr4);
chromStart: start of the annotated item in the query (120267916);
chromEnd: end of the annotated item in the query (120287252);
name: projection name in TOGA2 notation (input_transcript#chain_number(s)) (NM_001291281.3#FOXO6#5)
- Fragmented projections bear chain numbers for all annotated fragments as a comma-separated list, e.g. ENST00000369354.7#PDE4DIP#528,345
- Intact processed pseudogenes/retrogenes are designated with “#retro” postfix, e.g. NM_001354845.2#CCNB1#26913#retro
score: set to 0 in the output BED files;
thickStart: start of the annotated coding sequence in the projection (120267916);
- in query_annotation.bed, equals chromStart
thickEnd: end of the annotated coding sequence in the projection (120287252);
- in query_annotation.bed, equals chromEnd:
itemRgb: colour in RGB format corresponding to the projection loss status (0,0,100);
blockCount: number of exons in this projection (3);
blockSizes: exon sizes, listed from 5’ to 3’ (475,593,414);
blockStarts: relative start positions of exons, listed from 5’ to 3’ (0,673,18922,). Position 0 corresponds to chromStart

For all projections that reached the CESAR alignment step and all exons aligned by CESAR, see meta/query_annotation.with_discarded_exons.bed.

query_annotation.with_utrs.bed

Counterpart of query_annotation.bed with added untranslated regions (UTR) annotation, in BED12 format. Does not appear if TOGA2 --no_utr_annotation flag was set.

chr4	120267078	120287743	NM_001291281.3#FOXO6#5	0	-	120267916	120287252	0,0,100	3	1313,593,905,	0,1511,19760,

Field specification is the same as in query_annotation.bed. Important differences are:

If chromStart != thickStart, the projection has annotated UTR sequence upstream (from the 5’-side). Likewise, if chromEnd != thickEnd, projection has an annotated UTR from the downstream (3’) side;
blockCount contains the total number of exons, including both protein-coding and untranslated ones. Exon containing both coding and untranslated sequence are counted once;
blockSizes and blockStarts contain sizes and relative start positions for both coding and untranslated exons. xon containing both coding and untranslated sequence are counted once.

query_genes.tsv

Tab-separated list of all orthologous projections that are assigned to a query gene. The list is always given as pairs of:

query_gene: name of query gene/locus
projection: projection identifier for a transcript annotated by TOGA2 in this locus

query_gene	projection
ENSG00000206579	ENST00000327381.7#XKR4#143
ENSG00000104237	ENST00000220676.2#RP1#143
ENSG00000104237	ENST00000636932.1#RP1#143
reg_2	ENST00000398004.4#SLC35E3#35324

Note that this file represents gene-projection in ‘long data’ format: f multiple projections correspond to a single query gene, each of these gets a separate line in the file. Query genes are named after their orthologs in the reference. Loci that do not have an established ortholog (having only Lost/Missing projections assigned to them) are named in the reg_${id_number} format. See Query gene naming in TOGA2 for more logic behind query gene names.

query_genes.bed

BED6 file containing coordinates for genes listed in query_genes.tsv:

chr1	3216021	3671348	ENSG00000206579	0	-
chr1	4120003	4352825	ENSG00000104237	0	-
chr1	3531794	3532739	reg_2	0	+

processed_pseudogenes.bed

Processed pseudogene annotation in BED9/BED12 format. The number of columns depends on TOGA2 starting settings:

if --annotate_processed_pseudogenes is set, the file contains annotated pseudogene projections in BED12 format; RGB colour in column 9 represents the loss status of the projection:

chr14	103370464	103371423	ENST00000222286.9#GAPDHS#294601	0	-	103370464	103371423	255,160,120	1	959,	0,

otherwise, overlapping query spans are merged into processed pseudogene regions and presented in BED9 format; RGB colour in column 9 is set to pink:

chr14	103370464	103371423	ENST00000222286.9#GAPDHS#294601	0	-	103370464	103371423	250,50,200

Orthology inference

orthology_scores.tsv

Table of orthology probabilities predicted by the XGBoost classifier for all projections. Direct successor of TOGA1’s temp/orthology_scores.tsv file.

transcript	chain	pred
ENST00000249284.3#TAS2R16	510467	0.003924538381397724

Contains the following three columns:

transcript: reference transcript name (ENST00000249284.3#TAS2R16);
chain: chain identifier (510467);
pred: orthology probability according to the classifier (0.003924538381397724). Special values are reserved for programmatically defined projection classes, where the classifier results cannot be applied:
- -1.0 - for spanning chains
- -2.0 - for processed pseudogenes

orthology_classification.tsv

A five-column tab-separated file describing orthology relationship of genes/transcripts between reference and query. Direct successor of TOGA1’s orthology_classification.tsv.

t_gene	t_transcript	q_gene	q_transcript	orthology_class
ENSG00000000003	ENST00000373020.9#TSPAN6	ENSG00000000003	ENST00000373020.9#TSPAN6#11	one2one

Columns are:

r_gene: reference gene symbol (ENSG00000000003); if --isoforms file is not provided, reference transcript name is used instead;
r_transcript: reference transcript name (ENST00000373020.9#TSPAN6);
q_gene: query gene name (ENSG00000000003);
q_transcript: query transcript name (ENSG00000000003);
orthology_class: orthology relationship class (one2one)

Orthology relationship classes correspond to TOGA1 categories:

one2zero: reference gene has no annotated orthologs in the query, or all of its projections were classified as lost/missing;
one2one: reference gene has precisely one ortholog in the query;
one2many: reference gene has more than one ortholog in the query, indicating gene duplication in the query or gene loss in the reference;
many2one: multiple reference genes are orthologous to a single query locus; indicates gene loss in the query or gene duplication in the reference;
many2many: confounded orthology relationship, with multiple reference genes being classified as orthologous to multiple shared query loci.

Gene conservation

inactivating_mutations.tsv

Tab-separated table containing data on all identified mutations. This file is similar to inact_mut_check.tsv from TOGA1; however, its format was restructured and features referring to projections are now listed in meta/transcript_meta.tsv

For the sake of consistency with TOGA1, this file contains data on all tracked mutations, including those not treated as inactivating (start/stop codon losses, U12/non-canonical U2 splice sites, compensated frameshifts, and intron gains). Consult Inactivating mutations for more information on mutation types and their impact on loss status classification.

projection	exon	triplet	ref_codon	chrom	start	end	type	description	is_masked	masking_reason	mut_id
ENST00000315273.4#ASAP2#4707	2	43_67	43_67	chr12	23313321	23313334	Missing exon	-	NOT_MASKED	-	MIS_1

The table contains 12 columns:

projection: projection name in TOGA2 notation (input_transcript#chain_number(s)) (ENST00000315273.4#ASAP2#4707)
exon: exon number; numeration is 1-based, corresponds to reference exon structure, and accounts for coding exons only (2);
triplet: number of affected codon (triplet) in the codon alignment produced by TOGA2 (43_67);
- Triplet numeration is 1-based;
- If mutation spans over m multiple triplets, first and last triplet numbers are provided separated with underscore;
ref_codon: number of affected codon in the reference sequence (the first codon is codon 1) (43_67)
- Reference codon numeration is 1-based;
- If mutation spans over m multiple codons, first and last codon numbers are provided separated with underscore;
chrom: query contig/scaffold/chromosome name (chr12);
start: start position of the mutation or the affected codon in the query (23313321);
end: end position of the mutation or the affected codon in the query (23313334);
type: mutation type (Missing exon);
description: extended mutation description, if available (-);
is_masked: specifies whether the mutation is treated as inactivating (NOT_MASKED) or not (MASKED); Only NOT_MASKED mutations are considered when determining the gene classification (intact, lost etc)
masking_reason: if a mutation is MASKED, this field explains the reason for masking (-);
mut_id: internal mutation identifier; numeration is projection-specific (MIS_1) For the list of mutation types and masking reasons, as well as their impact on loss classification, consult the Inactivating mutations page.

loss_summary.tsv

Three-column tab-separated table listing the classification of projections, transcripts and genes in the query. Successor of TOGA1’s loss_summ_data.tsv.

level	entry	status
PROJECTION	ENST00000641156.1#OR56A4#1995	FI
TRANSCRIPT	ENST00000641156.1#OR56A4	FI
GENE	ENSG00000183389	FI

Columns in the file are:

level: PROJECTION for query projections, TRANSCRIPT for reference transcripts, and GENE for reference genes;
entry: entity’s ID;
status: entity’s loss status

Loss statuses for projections in the query are imported from meta/transcript_meta.tsv. At the transcript level, the classification is inferred by considering all orthologous projections of the respective transcript in the query and the ranking below. Similarly, if --isoforms file is provided, the classification of the gene is inferred by considering all its transcripts (isoforms).

Note

If --isoforms file is not provided, GENE-level entries do not appear in the file

For gene loss classification procedure and available loss classes, consult the Loss status classification page

Alignments

codon_aln.fa(.gz)

Multi-FASTA file listing the pairwise codon alignments between reference and query nucleotide sequences for all projections. Compressed into gzip format by default. Individual codons are separated by space.

>ENST00000393432.9#HNRNPH1#13270| 13270 | CODON | REFERENCE
ATG ATG TTG GGC ACG GAA GGT GGA GAG GGA TTC GTG GTG AAG GTC CGG GGC TTG CCC TGG TCT TGC TCG GCC GAT GAA GTG CAG AGG TTT TTT TCT GAC TGC AAA ATT CAA AAT GGG GCT CAA GGT ATT CGT TTC ATC TAC ACC AGA GAA GGC AGA CCA AGT GGC GAG GCT TTT GTT GAA CTT GAA TCA GAA GAT GAA GTC AAA TTG GCC CTG AAA AAA GAC AGA GAA ACT ATG GGA CAC AGA TAT GTT GAA GTA TTC AAG TCA AAC AAC GTT GAA ATG GAT TGG GTG TTG AAG CAT ACT GGT CCA AAT AGT CCT GAC ACG GCC AAT GAT GGC TTT GTA CGG CTT AGA GGA CTT CCC TTT GGA TGT AGC AAG GAA GAA ATT GTT CAG TTC TTC TCA GGG TTG GAA ATC GTG CCA AAT GGG ATA ACA TTG CCG GTG GAC TTC CAG GGG AGG AGT ACG GGG GAG GCC TTC GTG CAG TTT GCT TCA CAG GAA ATA GCT GAA AAG GCT CTA AAG AAA CAC AAG GAA AGA ATA GGG CAC AGG TAT ATT GAA ATC TTT AAG AGC AGT AGA GCT GAA GTT AGA ACT CAT TAT GAT CCA CCA CGA AAG CTT ATG GCC ATG CAG CGG CCA GGT CCT TAT GAC AGA CCT GGG GCT GGT AGA GGG TAT AAC AGC ATT GGC AGA GGA GCT GGC TTT GAG AGG ATG AGG CGT GGT GCT TAT GGT GGA GGC TAT GGA GGC TAT GAT GAT TAC AAT GGC TAT AAT GAT GGC TAT GGA TTT GGG TCA GAT AGA TTT GGA AGA GAC CTC AAT TAC TGT TTT TCA GGA ATG TCT GAT CAC AGA TAC GGG GAT GGT GGC TCT ACT TTC CAG AGC ACA ACA GGA CAC TGT GTA CAC ATG CGG GGA TTA CCT TAC AGA GCT ACT GAG AAT GAC ATT TAT AAT TTT TTT TCA CCG CTC AAC CCT GTG AGA GTA CAC ATT GAA ATT GGT CCT GAT GGC AGA GTA ACT GGT GAA GCA GAT GTC GAG TTC GCA ACT CAT GAA GAT GCT GTG GCA GCT ATG TCA AAA GAC AAA GCA AAT ATG CAA CAC AGA TAT GTA GAA CTC TTC TTG AAT TCT ACA GCA GGA GCA AGC GGT GGT GCT TAC GAA CAC AGA TAT GTA GAA CTC TTC TTG AAT TCT ACA GCA GGA GCA AGC GGT GGT GCT TAT GGT AGC CAA ATG ATG GGA GGC ATG GGC TTG TCA AAC CAG TCC AGC TAC GGG GGC CCA GCC AGC CAG CAG CTG AGT GGG GGT TAC GGA GGC GGC TAC GGT GGC CAG AGC AGC ATG AGT GGA TAC GAC CAA GTT TTA CAG GAA AAC TCC AGT GAT TTT CAA TCA AAC ATT GCA XXX
>ENST00000393432.9#HNRNPH1#13270| 13270 | CODON | QUERY
ATG ATG CTG GGC ACA GAA GGC AGG GAG GGT TTC GTG GTG AAG GTC AGG GGC CTA CCC TGG TCC TGC TCT GCC GAT GAA GTG ATG CGC TTC TTT TCT GAT TGC AAA ATC CAA AAT GGC ACA TCA GGT ATC CGT TTC ATC TAT ACC AGA GAA GGC AGA CCA AGT GGT GAA GCA TTT GTT GAA CTT GAA TCA GAA GAT GAA GTG AAA TTG GCT TTG AAG AAG GAC AGA GAA ACC ATG GGA CAC AGA TAT GTT GAA GTA TTC AAG TCC AAT AGT GTT GAA ATG GAT TGG GTA TTG AAG CAT ACA GGT CCG AAT AGT CCC GAT ACT GCC AAT GAT GGC TTC GTC CGT CTT CGA GGA CTC CCG TTT GGC TGT AGC AAG GAG GAG ATT GTT CAG TTT TTT TCA GGG CTG GAA ATT GTG CCA AAT GGG ATG ACA CTG CCG GTG GAC TTT CAG GGG CGG AGC ACA GGG GAG GCC TTT GTG CAG TTT GCT TCA CAG GAG ATA GCT GAA AAG GCC TTA AAG AAA CAC AAG GAA AGA ATA GGG CAC AGG TAC ATT GAA ATC TTT AAG AGT AGC CGA GCT GAA GTC CGA ACC CAC TAT GAC CCC CCT CGA AAG CTC ATG GCT ATG CAA CGA CCA GGT CCC TAT GAT AGG CCA GGG GCC GGC AGA GGG TAT AAT AGT ATT GGA AGA GGG ACT GGG TTT GAA AGG ATG AGG CGG GGT GCC TAT GGT GGA GGG TAT GGA GGC TAT GAT GAT TAT GGT GGC TAT AAT GAT GGC TAT GGC TTT GGG TCT GAT AGA TTT GGA AGA GAT CTC AAT TAC TGT TTT TCA GGA ATG TCT GAT CAT AGA TAC GGA GAT GGT GGG TCC AGT TTC CAA AGC ACC ACA GGG CAC TGT GTA CAC ATG AGG GGA TTA CCT TAC AGA GCT ACT GAA AAT GAC ATT TAC AAT TTT TTC TCA CCT CTT AAC CCC ATG AGA GTA CAC ATT GAA ATT GGA CCT GAT GGC AGA GTT ACT GGT GAG GCA GAT GTT GAA TTT GCT ACT CAT GAA GAT GCC GTG GCA GCT ATG GCA AAA GAT AAG GCT AAT ATG CAA CAC AGA TAT GTG GAG CTC TTC TTA AAT TCT ACT GCA GGA ACA AGT GGT GGG GCT TAT GAT CAC AGC TAT GTA GAA CTC TTT TTG AAT TCT ACA GCA GGG GCA AGT GGT GGT GCT TAT GGT AGC CAA ATG ATG GGA GGG ATG GGC TTA TCC AAC CAG TCT AGT TAT GGG GGT CCT GCT AGC CAG CAG CTG AGT GGT GGT TAC GGG GGT GGT TAT GGT GGT CAG AGC AGT ATG AGT GGA TAT GAC CAA GTT CTG CAG GAA AAT TCC AGT GAC TAT CAG TCA AAC CTT GCG XXX

Fasta headers contain the following fields separated with pipe-with-whitespaces (‘ | ‘):

projection: projection name in TOGA2 notation (input_transcript#chain_number(s)) (ENST00000393432.9#HNRNPH1#13270)
chain_id: number of the chain the projection was annotated through (13270);
CODON: keyword indicating that the sequence comes from the codon alignment file;
source: indicates whether the sequence corresponds to reference transcript (REFERENCE) or query projection (QUERY)

exon_aln.fa(.gz)

Multi-FASTA file listing the nucleotide sequence alignments per exon between reference and query exons. Compressed into gzip format by default. Reference-query exon pairs are presented projection-wise in the ascending exon number order. Exon numeration starts with 1 and follows exon order in the reference transcript. This means, exons merged in the query due to precise intron deletion are still presented as separate entries; likewise, exons split in the query due to intron gain are presented as single entities, with query intron sequence given in lowercase. Note that only coding sequence exons are considered, even if untranslated region annotation was not disabled.

>ENST00000270112.7#HUNK#58 | 9 | 58 | reference_exon
GCCTCTCTGGACACCTGGACACGAGATCTTGAATTCCATGCCGTGCAG
>ENST00000270112.7#HUNK#58 | 9 | 58 | scaffold_3:1435950-1435998 | 81.25 | 74.42 | scaffold_3:1435950-1435998 | INC | ORTHOLOG | query_exon
GCCTCCCTGGACGCCTGGACGCGGGACCTGGACTTCCCTGCCGTGCGG

Fasta headers contain the following fields, separated with a pipe-with-whitespaces (‘ | ’) delimiter:

For reference exons:
- projection: projection name in TOGA2 notation (input_transcript#chain_number(s)) (ENST00000270112.7#HUNK#58)
- exon: exon number (9)
- chain_id: number of the chain the projection was annotated through (58)
- reference_exon: keyword to indicate that the following sequence is the reference exon
For query exons:
- projection: projection name in TOGA2 notation (input_transcript#chain_number(s)) (ENST00000270112.7#HUNK#58)
- exon: exon number (9)
- chain_id: number of the chain the projection was annotated through (58)
- coordinates: query coordinates given as scaffold:start-end (scaffold_3:1435950-1435998)
- %id: %nucleotide identity between the reference and query exon (81.25)
- %blosum: %BLOSUM score between the reference and query exon (74.42)
- expected_coordinates: expected query coordinates, given as scaffold:start-end, based on the chain alignment (scaffold_3:1435950-1435998). Note: this can differ between the final annotated exon coordinates, as TOGA2 may shift splice sites or use additional flanking space for exon alignment. This field is mostly relevant for debugging purposes.
- expected_locus: correspondence to the expected locus (INC). This indicates whether the final exon coordinates overlap the expected locus with INC indicates that the locus intersects the expected coordinates by at least one base and EXCL indicates otherwise.
- orthology_status: orthology status (ORTHOLOG). The keywords that are possible are ORTHOLOG for orthologous projections, PARALOG for paralogous projection, and PROCESSED_PSEUDOGENE for processed pseudogenes/retrogenes.
- query_exon: keyword to indicate that the following sequence is the query exon

protein_aln.fa(.gz)

Aggregated pairwise amino acid sequence alignments for annotated projections in FASTA format. Compressed into gzip format by default.

>NM_001252010.2#LUZP2#436 | PROT | REFERENCE
MKFSPAHYLLPLLPALVLSTR-QDYEELEKQLKEVFKERSTILRQLTKTSRELDGIKVNLQSLKNDEQSAKTDVQKLLELGQKQREEMKSLQEALQNQLKETSEKAEKHQATINFLKTEVERK-SKMIRDLQNE---AQQLTDLEQKLAVAKNELEKAA-LD-R-ESQMKAMKETV-QLCLTSVFRDQPPPPLSLITSNPTRMLLPPRNIASKLPDAAAKSK---PQQSASGNNESSQV-EST---KEGNPSTTACDSQD--EGR-PCSMKHKESPPSNATAETEPIPQ-KLQMPPCSECEVKKAPEKPLTSFEGMAAREEKIL*
>NM_001252010.2#LUZP2#436 | PROT | QUERY
MTCCPXLLILPLLQALVLSTSC----------K-CFPEKS--LKKLSNT---------NLK---------KQDNEKLV-LISRKC---------LKNE--EVKEK--KTQSGMILMATGLLRKVGRAV-DLTVEKKK-------EEELV------QKAAFLDNRG-----ATREMISQ---------ENNPPLNLIIEA---------GLIPKLVDFL---KEPREQQSS--QTKASQLTEQTLLRKE-NPGT-------NLERRILCXQMHFESCKAY----------SRSQ-----------APEHP------AEAKEDA--*

Fasta headers contain the following fields, separated with a pipe-with-whitespaces (‘ | ’) delimiter:

projection: projection name in TOGA2 notation (input_transcript#chain_number(s)) (NM_001252010.2#LUZP2#436);
PROT: keyword indicating that the sequence comes from the protein alignment file;
source: indicates whether the sequence corresponds to reference transcript (REFERENCE) or query projection (QUERY)

Annotated sequences

nucleotide.fa(.gz)

A multi-FASTA file containing annotated query transcripts (coding sequences only). Compressed into gzip format by default.

>ENST00000315273.4#ASAP2#4707
ATGCCAGAACAGATCTCCGTGTCGGAATTCATAGCCGAGACCCTTGAGGACTACAAGGCGCCCACGGCCTATAGCTTCACCACGCGCACGGCCCAGTGCCGGGACACCATGTCGGCCATCGAGGAGGCCTTGGAGAAAATACTTAGTCTTACAACTCTCACGGGCGACGGCTTCAAGTTCCAATTTTTTGATGCCATTGTAAGTATGGGTGATCTACACAATAAATTGATTGACAAGAATTATAATGACTATAAAGAGACTTGCCAAGATTGAAGAATTCGAATCAATCCACACTTATCTCCTTGTACTAAGGTCAAATCTAAGTGGATCAAGGAACTTCATATAAAACCAGAGACACTGAAACTTATAGAGGAGAAAGTGGGGAAAAGCCTTGAAGATATGGGCACAGGGGAAAAATTCCTGAACAGAACAGCAATGGCTTGTGCTGTCTATGAAGTGCAGTGTGGGTATCCTGTACTTATGGTTATTATTCACTTTTTAGCACCACTCCAGCCATCTTTTAACAGAGCTACACTTGGGAAAACCATCATATTAAAAGACATTGTCAAAGTTCAAGAAGAAATGAAAGGTAATTTTCACGAGGGCCAAGATGCTCAGATTAAAAAATATTTTATGTTTGAGTCAATGCAGGGTTCTGGGCAGCCCCAGAGCAGCTTAGGACAGGTCAGGAAATTGAGGCTGAGAGGTAAGAAGAACCCTATCTCACTGGTTAGCTGTGTGTTGTCCCACTCAGAAGAGGTGGCTGAGCCATGTGAGACAGCTTACCACTCACTGGAGGCTTTCTGTAGGGCTCCTACCCTGCCCACAGAGGGCCTGGAGGCCCTTAAGGACACCAGACAGAACATTTCAGACCAGATGATTTATTCTATGATCCCAGCTGACTTTGTGAGGATGAAGAAGATGGACAACTGGCCAGCAAGTAACAGTGGGAGAATCAGTCCCAAGAGTTTCTCAAGACACACTGAGATAGGTGCTCAGTGCTCTCACCTTCTGTCTGTCTGTCTGTCTACCAGGCTGGCCTTGATCTCAGAAATCTGCCTGCCTCTGCCTCCCAAAGGCATTTCTTTTAAGTATATCGACGTTATTTTTTATTTATCTATTTTTAACTTGTCTGCAAAACATAAGGAAGCTGTTCAAGCCAGGGAATCAAGCCAGCAAAACAAAGATACATTAAGCAGGGAGAGGAGACTCAACCTCAGAGTTCTGGTGTGACAGCCATGAGAAGCAGCAGTACATGTTATCCAGCTCAATCAGAAAAAATCCTTTATGTGAAGGAAATGATATCCACATGGAGGAGGGGTTTCTTCCCCACCCCACCTCATCCAATCAGAATTCACACATGCTCCAGGCACAAGCTTAGGTCATTCACAGATAGGCTGACAGGTCTGGATCACGAGGAACTGTCCAGCCTGGCAGGCAGCCCACACATGCAGCCTCACTATGTGAAAAGTCACACTCTAGAAGGTTCTAGGTCCCTGTGTAGAAAGTGCCTGCCAATGTCTGCAGCACTGGCTTTTGTCAGAGACCATCCCTTGAGAAAGCAAGCCCTTCACAATCTCCCTAGGCAGGACTCTAACCCATTCTCCAAGACGATGTCCCTCCCTGCTGCCCCCATGCACCAAGATCTTAGTGCTGTCAAATCTAAGTGGATCAAGGAACTTCACATAAAACCAGAGACACTGAAACTTATAGAGGAGAAAGTGATTTGGCACCCTCACACAGACATACATACAAGAAAAACACCAATGCCTGTGAAATAA

Headers contain projection names in TOGA2 format (input_transcript#chain_number(s)).

exon_seqs.2bit

Compressed and indexed file in UCSC 2bit format that contains all query transcript sequences.

Fasta headers are query projection identifiers. In the sequence line, all coding nucleotides are uppercase, with lowercase n symbols being used to delimit individual exon sequences. If two or more n symbols follow each other, the respective exon(s) were not aligned or were classified as missing/deleted by TOGA2.

Note that this is not aligned sequence; hence, there are no deletion symbols in this file.

>ENST00000313735.11#NOS2#2666
ATGTATGCTGCATGCTTTTTCTCCCTTTTATTGGAGGTTAACAGTTCTTT
CTCCTTCCAGTTnCAGGCTCAGTACTTCACGGGGTGTGGAAATGnACATC
ATGGCACAATCTTACCGTGAATACCATGTTCATTAAAAAGAAAGCAAAAT
...

Warning

This file is heavily optimised for downstream analysis. Do not use it for exon nucleotide sequence extraction unless you have reasons to do so and you know what you’re doing.

Information about the TOGA2 run

rejected_items.tsv

An extensive table of all transcripts/projections rejected at any point of TOGA2 pipeline. Columns are:

level: entity level; TRANSCRIPT for reference transcripts/isoforms, PROJECTION for individual projections in the query;
item: transcript/projection identifier;
segment: technical field implemented for debugging purposes; will likely be removed in the future releases
rejection_reason: a verbose explanation for the reasons behind rejecting the entity;
rej_id: a concise rejection reason identifier;
loss_status: loss status inferred for the entity at the moment of rejection For breakdown on rejection reasons, see Item rejection

summary.txt

A short summary of orthology score prediction, gene loss analysis, query gene inference, and orthology classification steps. The same summary is printed to TOGA2 run log at the end of the pipeline. Consult it if you need a short reference on TOGA2 results.

####################################################################################################
#### TOGA2 run summary
####################################################################################################
Reference genome: test_input/hg38.micro_sample.2bit
Query genome: test_input/q2bit_micro_sample.2bit
Genome alignment chain file: test_input/align_micro_sample.chain
Reference annotation file: test_input/annot_micro_sample.bed
Output directory: micro_test_results
NOTE: Reference isoform file was not provided. All gene level statistics are presented assuming that each reference transcript represents a separate gene
####################################################################################################
Orthology prediction statistics:
	Reference transcripts level::
		#reference transcripts subjected to classification: 3
		#reference transcripts with >=1 predicted ortholog: 3 (100.0%)
		#reference transcripts with 1 predicted ortholog: 3 (100.0%)
		#reference transcripts with no predicted orthologs: 0 (0.0%)
		#reference transcripts with no classifiable projections: 0 (0.0%)
		#predicted processed pseudogene/retrogene projections 0:
	Reference genes level:
		#reference genes with >=1 predicted ortholog: 3 (100.0%)
		#reference genes with 1 predicted ortholog: 3 (100.0%)
		#reference genes with no predicted orthologs: 0 (0.0%)
####################################################################################################
Gene loss summary statistics:
	loss classes considered for assessing gene presence: FI,I,PI,UL
	Query projections level:
		#Fully Intact (FI) - 1 (33.333%)
		#Intact (I) - 1 (33.333%)
		#Partially Intact (PI) - 1 (33.333%)
		#Uncertain Losses (UL) - 0 (0.0%)
		#Lost (L) - 0 (0.0%)
		#Missing (M) - 0 (0.0%)
		#projections considered present: 3 (100.0%)
		#projections considered lost/missing: 0 (0.0%)
	Reference transcripts level:
		#Fully Intact (FI) - 1 (33.333%)
		#Intact (I) - 1 (33.333%)
		#Partially Intact (PI) - 1 (33.333%)
		#Uncertain Losses (UL) - 0 (0.0%)
		#Lost (L) - 0 (0.0%)
		#Missing (M) - 0 (0.0%)
		#trancripts considered present: 3 (100.0%)
		#transcripts considered lost/missing: 0 (0.0%)
	Reference genes level:
		#Fully Intact (FI) - 0 (33.333%)
		#Intact (I) - 0 (33.333%)
		#Partially Intact (PI) - 0 (33.333%)
		#Uncertain Losses (UL) - 0 (0.0%)
		#Lost (L) - 0 (0.0%)
		#Missing (M) - 0 (0.0%)
		#reference genes considered present: 3 (100.0%)
		#reference genes considered lost/missing: 0 (0.0%)
####################################################################################################
Orthology resolution:
	#reference genes: 3
	#query genes: 3
		#with defined orthology: 3 (100.0%)
		#lost, missing, or lacking defined orthology: 0 (0.0%) 
	Reference gene orthology class composition:
		one2one: 3 (100.0%)
		one2many: 0 (0.0%)
		many2one: 0 (0.0%)
		many2many: 0 (0.0%)
		one2zero: 0 (0.0%)

Top level files

Contents

Gene annotation

Orthology inference

Gene classification

Alignments

Annotated sequences

Information about the TOGA2 run

Gene annotation

query_annotations.bed

query_annotation.with_utrs.bed

query_genes.tsv

query_genes.bed

processed_pseudogenes.bed

Orthology inference

orthology_scores.tsv

orthology_classification.tsv

Gene conservation

inactivating_mutations.tsv

loss_summary.tsv

Alignments

codon_aln.fa(.gz)

exon_aln.fa(.gz)

protein_aln.fa(.gz)

Annotated sequences

nucleotide.fa(.gz)

exon_seqs.2bit

Information about the TOGA2 run

rejected_items.tsv

summary.txt

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Contents

Clone this wiki locally