-
Notifications
You must be signed in to change notification settings - Fork 0
Top level files
Data on transcripts and genes annotated in the query assembly
- query_annotation.bed
- query_annotation.with UTRs.bed
- query_genes.tsv
- query_genes.bed
- processed_pseudogenes.bed
Files describing homology relationships between reference and query items,
Data on functional conservation of annotated homologs, gene loss, and inactivating mutations in the query
Alignments between reference transcripts and their annotated homologs in the query
Full sequences for annotated transcripts
Overview of TOGA2 results, including items discarded from the final output
The ultimate TOGA2 annotation output, containing all projections annotated by TOGA2 in BED12 format. This file contains:
- a final set of annotated orthologous projections, including those classified as Lost or Missing;
- paralogous projections for transcripts which ended up without predicted orthologs in the query;
- (if
--annotate_processed_pseudogenesflag was set:) processed pseudogene projections with loss status of Intact or Fully Intact
chr4 120267916 120287252 NM_001291281.3#FOXO6#5 0 - 120267916 120287252 0,0,100 3 475,593,414, 0,673,18922,
BED12 fields are:
-
chrom: name of the chromosome (contig, scaffold, etc.) the projection was annotated in (chr4); -
chromStart: start of the annotated item in the query (120267916); -
chromEnd: end of the annotated item in the query (120287252); -
name: projection name in TOGA2 notation (input_transcript#chain_number(s)) (NM_001291281.3#FOXO6#5)- Fragmented projections bear chain numbers for all annotated fragments as a comma-separated list, e.g.
ENST00000369354.7#PDE4DIP#528,345 - Intact processed pseudogenes/retrogenes are designated with “#retro” postfix, e.g. NM_001354845.2#CCNB1#26913#retro
- Fragmented projections bear chain numbers for all annotated fragments as a comma-separated list, e.g.
-
score: set to0in the output BED files; -
thickStart: start of the annotated coding sequence in the projection (120267916);- in
query_annotation.bed, equalschromStart
- in
-
thickEnd: end of the annotated coding sequence in the projection (120287252);- in
query_annotation.bed, equalschromEnd:
- in
-
itemRgb: colour in RGB format corresponding to the projection loss status (0,0,100); -
blockCount: number of exons in this projection (3); -
blockSizes: exon sizes, listed from 5’ to 3’ (475,593,414); -
blockStarts: relative start positions of exons, listed from 5’ to 3’ (0,673,18922,). Position 0 corresponds tochromStart
For all projections that reached the CESAR alignment step and all exons aligned by CESAR, see meta/query_annotation.with_discarded_exons.bed.
Counterpart of query_annotation.bed with added untranslated regions (UTR) annotation, in BED12 format. Does not appear if TOGA2 --no_utr_annotation flag was set.
chr4 120267078 120287743 NM_001291281.3#FOXO6#5 0 - 120267916 120287252 0,0,100 3 1313,593,905, 0,1511,19760,
Field specification is the same as in query_annotation.bed. Important differences are:
- If
chromStart!=thickStart, the projection has annotated UTR sequence upstream (from the 5’-side). Likewise, ifchromEnd!=thickEnd, projection has an annotated UTR from the downstream (3’) side; -
blockCountcontains the total number of exons, including both protein-coding and untranslated ones. Exon containing both coding and untranslated sequence are counted once; -
blockSizesandblockStartscontain sizes and relative start positions for both coding and untranslated exons. xon containing both coding and untranslated sequence are counted once.
Tab-separated list of all orthologous projections that are assigned to a query gene. The list is always given as pairs of:
-
query_gene: name of query gene/locus -
projection: projection identifier for a transcript annotated by TOGA2 in this locus
query_gene projection
ENSG00000206579 ENST00000327381.7#XKR4#143
ENSG00000104237 ENST00000220676.2#RP1#143
ENSG00000104237 ENST00000636932.1#RP1#143
reg_2 ENST00000398004.4#SLC35E3#35324
Note that this file represents gene-projection in ‘long data’ format: f multiple projections correspond to a single query gene, each of these gets a separate line in the file.
Query genes are named after their orthologs in the reference. Loci that do not have an established ortholog (having only Lost/Missing projections assigned to them) are named in the reg_${id_number} format. See Query gene naming in TOGA2 for more logic behind query gene names.
BED6 file containing coordinates for genes listed in query_genes.tsv:
chr1 3216021 3671348 ENSG00000206579 0 -
chr1 4120003 4352825 ENSG00000104237 0 -
chr1 3531794 3532739 reg_2 0 +
Processed pseudogene annotation in BED9/BED12 format. The number of columns depends on TOGA2 starting settings:
- if
--annotate_processed_pseudogenesis set, the file contains annotated pseudogene projections in BED12 format; RGB colour in column 9 represents the loss status of the projection:
chr14 103370464 103371423 ENST00000222286.9#GAPDHS#294601 0 - 103370464 103371423 255,160,120 1 959, 0,
- otherwise, overlapping query spans are merged into processed pseudogene regions and presented in BED9 format; RGB colour in column 9 is set to pink:
chr14 103370464 103371423 ENST00000222286.9#GAPDHS#294601 0 - 103370464 103371423 250,50,200
Table of orthology probabilities predicted by the XGBoost classifier for all projections. Direct successor of TOGA1’s temp/orthology_scores.tsv file.
transcript chain pred
ENST00000249284.3#TAS2R16 510467 0.003924538381397724
Contains the following three columns:
-
transcript: reference transcript name (ENST00000249284.3#TAS2R16); -
chain: chain identifier (510467); -
pred: orthology probability according to the classifier (0.003924538381397724). Special values are reserved for programmatically defined projection classes, where the classifier results cannot be applied:- -1.0 - for spanning chains
- -2.0 - for processed pseudogenes
A five-column tab-separated file describing orthology relationship of genes/transcripts between reference and query. Direct successor of TOGA1’s orthology_classification.tsv.
t_gene t_transcript q_gene q_transcript orthology_class
ENSG00000000003 ENST00000373020.9#TSPAN6 ENSG00000000003 ENST00000373020.9#TSPAN6#11 one2one
Columns are:
-
r_gene: reference gene symbol (ENSG00000000003); if --isoforms file is not provided, reference transcript name is used instead; -
r_transcript: reference transcript name (ENST00000373020.9#TSPAN6); -
q_gene: query gene name (ENSG00000000003); -
q_transcript: query transcript name (ENSG00000000003); -
orthology_class: orthology relationship class (one2one)
Orthology relationship classes correspond to TOGA1 categories:
- one2zero: reference gene has no annotated orthologs in the query, or all of its projections were classified as lost/missing;
- one2one: reference gene has precisely one ortholog in the query;
- one2many: reference gene has more than one ortholog in the query, indicating gene duplication in the query or gene loss in the reference;
- many2one: multiple reference genes are orthologous to a single query locus; indicates gene loss in the query or gene duplication in the reference;
- many2many: confounded orthology relationship, with multiple reference genes being classified as orthologous to multiple shared query loci.
Tab-separated table containing data on all identified mutations. This file is similar to inact_mut_check.tsv from TOGA1; however, its format was restructured and features referring to projections are now listed in meta/transcript_meta.tsv
For the sake of consistency with TOGA1, this file contains data on all tracked mutations, including those not treated as inactivating (start/stop codon losses, U12/non-canonical U2 splice sites, compensated frameshifts, and intron gains). Consult Inactivating mutations for more information on mutation types and their impact on loss status classification.
projection exon triplet ref_codon chrom start end type description is_masked masking_reason mut_id
ENST00000315273.4#ASAP2#4707 2 43_67 43_67 chr12 23313321 23313334 Missing exon - NOT_MASKED - MIS_1
The table contains 12 columns:
-
projection: projection name in TOGA2 notation (input_transcript#chain_number(s)) (ENST00000315273.4#ASAP2#4707) -
exon: exon number; numeration is 1-based, corresponds to reference exon structure, and accounts for coding exons only (2); -
triplet: number of affected codon (triplet) in the codon alignment produced by TOGA2 (43_67);- Triplet numeration is 1-based;
- If mutation spans over m multiple triplets, first and last triplet numbers are provided separated with underscore;
-
ref_codon: number of affected codon in the reference sequence (the first codon is codon 1) (43_67)- Reference codon numeration is 1-based;
- If mutation spans over m multiple codons, first and last codon numbers are provided separated with underscore;
-
chrom: query contig/scaffold/chromosome name (chr12); -
start: start position of the mutation or the affected codon in the query (23313321); -
end: end position of the mutation or the affected codon in the query (23313334); -
type: mutation type (Missing exon); -
description: extended mutation description, if available (-); -
is_masked: specifies whether the mutation is treated as inactivating (NOT_MASKED) or not (MASKED); Only NOT_MASKED mutations are considered when determining the gene classification (intact, lost etc) -
masking_reason: if a mutation is MASKED, this field explains the reason for masking (-); -
mut_id: internal mutation identifier; numeration is projection-specific (MIS_1) For the list of mutation types and masking reasons, as well as their impact on loss classification, consult the Inactivating mutations page.
Three-column tab-separated table listing the classification of projections, transcripts and genes in the query. Successor of TOGA1’s loss_summ_data.tsv.
level entry status
PROJECTION ENST00000641156.1#OR56A4#1995 FI
TRANSCRIPT ENST00000641156.1#OR56A4 FI
GENE ENSG00000183389 FI
Columns in the file are:
-
level:PROJECTIONfor query projections,TRANSCRIPTfor reference transcripts, andGENEfor reference genes; -
entry: entity’s ID; -
status: entity’s loss status
Loss statuses for projections in the query are imported from meta/transcript_meta.tsv. At the transcript level, the classification is inferred by considering all orthologous projections of the respective transcript in the query and the ranking below. Similarly, if --isoforms file is provided, the classification of the gene is inferred by considering all its transcripts (isoforms).
Note
If --isoforms file is not provided, GENE-level entries do not appear in the file
For gene loss classification procedure and available loss classes, consult the Loss status classification page
Multi-FASTA file listing the pairwise codon alignments between reference and query nucleotide sequences for all projections. Compressed into gzip format by default. Individual codons are separated by space.
>ENST00000393432.9#HNRNPH1#13270| 13270 | CODON | REFERENCE
ATG ATG TTG GGC ACG GAA GGT GGA GAG GGA TTC GTG GTG AAG GTC CGG GGC TTG CCC TGG TCT TGC TCG GCC GAT GAA GTG CAG AGG TTT TTT TCT GAC TGC AAA ATT CAA AAT GGG GCT CAA GGT ATT CGT TTC ATC TAC ACC AGA GAA GGC AGA CCA AGT GGC GAG GCT TTT GTT GAA CTT GAA TCA GAA GAT GAA GTC AAA TTG GCC CTG AAA AAA GAC AGA GAA ACT ATG GGA CAC AGA TAT GTT GAA GTA TTC AAG TCA AAC AAC GTT GAA ATG GAT TGG GTG TTG AAG CAT ACT GGT CCA AAT AGT CCT GAC ACG GCC AAT GAT GGC TTT GTA CGG CTT AGA GGA CTT CCC TTT GGA TGT AGC AAG GAA GAA ATT GTT CAG TTC TTC TCA GGG TTG GAA ATC GTG CCA AAT GGG ATA ACA TTG CCG GTG GAC TTC CAG GGG AGG AGT ACG GGG GAG GCC TTC GTG CAG TTT GCT TCA CAG GAA ATA GCT GAA AAG GCT CTA AAG AAA CAC AAG GAA AGA ATA GGG CAC AGG TAT ATT GAA ATC TTT AAG AGC AGT AGA GCT GAA GTT AGA ACT CAT TAT GAT CCA CCA CGA AAG CTT ATG GCC ATG CAG CGG CCA GGT CCT TAT GAC AGA CCT GGG GCT GGT AGA GGG TAT AAC AGC ATT GGC AGA GGA GCT GGC TTT GAG AGG ATG AGG CGT GGT GCT TAT GGT GGA GGC TAT GGA GGC TAT GAT GAT TAC AAT GGC TAT AAT GAT GGC TAT GGA TTT GGG TCA GAT AGA TTT GGA AGA GAC CTC AAT TAC TGT TTT TCA GGA ATG TCT GAT CAC AGA TAC GGG GAT GGT GGC TCT ACT TTC CAG AGC ACA ACA GGA CAC TGT GTA CAC ATG CGG GGA TTA CCT TAC AGA GCT ACT GAG AAT GAC ATT TAT AAT TTT TTT TCA CCG CTC AAC CCT GTG AGA GTA CAC ATT GAA ATT GGT CCT GAT GGC AGA GTA ACT GGT GAA GCA GAT GTC GAG TTC GCA ACT CAT GAA GAT GCT GTG GCA GCT ATG TCA AAA GAC AAA GCA AAT ATG CAA CAC AGA TAT GTA GAA CTC TTC TTG AAT TCT ACA GCA GGA GCA AGC GGT GGT GCT TAC GAA CAC AGA TAT GTA GAA CTC TTC TTG AAT TCT ACA GCA GGA GCA AGC GGT GGT GCT TAT GGT AGC CAA ATG ATG GGA GGC ATG GGC TTG TCA AAC CAG TCC AGC TAC GGG GGC CCA GCC AGC CAG CAG CTG AGT GGG GGT TAC GGA GGC GGC TAC GGT GGC CAG AGC AGC ATG AGT GGA TAC GAC CAA GTT TTA CAG GAA AAC TCC AGT GAT TTT CAA TCA AAC ATT GCA XXX
>ENST00000393432.9#HNRNPH1#13270| 13270 | CODON | QUERY
ATG ATG CTG GGC ACA GAA GGC AGG GAG GGT TTC GTG GTG AAG GTC AGG GGC CTA CCC TGG TCC TGC TCT GCC GAT GAA GTG ATG CGC TTC TTT TCT GAT TGC AAA ATC CAA AAT GGC ACA TCA GGT ATC CGT TTC ATC TAT ACC AGA GAA GGC AGA CCA AGT GGT GAA GCA TTT GTT GAA CTT GAA TCA GAA GAT GAA GTG AAA TTG GCT TTG AAG AAG GAC AGA GAA ACC ATG GGA CAC AGA TAT GTT GAA GTA TTC AAG TCC AAT AGT GTT GAA ATG GAT TGG GTA TTG AAG CAT ACA GGT CCG AAT AGT CCC GAT ACT GCC AAT GAT GGC TTC GTC CGT CTT CGA GGA CTC CCG TTT GGC TGT AGC AAG GAG GAG ATT GTT CAG TTT TTT TCA GGG CTG GAA ATT GTG CCA AAT GGG ATG ACA CTG CCG GTG GAC TTT CAG GGG CGG AGC ACA GGG GAG GCC TTT GTG CAG TTT GCT TCA CAG GAG ATA GCT GAA AAG GCC TTA AAG AAA CAC AAG GAA AGA ATA GGG CAC AGG TAC ATT GAA ATC TTT AAG AGT AGC CGA GCT GAA GTC CGA ACC CAC TAT GAC CCC CCT CGA AAG CTC ATG GCT ATG CAA CGA CCA GGT CCC TAT GAT AGG CCA GGG GCC GGC AGA GGG TAT AAT AGT ATT GGA AGA GGG ACT GGG TTT GAA AGG ATG AGG CGG GGT GCC TAT GGT GGA GGG TAT GGA GGC TAT GAT GAT TAT GGT GGC TAT AAT GAT GGC TAT GGC TTT GGG TCT GAT AGA TTT GGA AGA GAT CTC AAT TAC TGT TTT TCA GGA ATG TCT GAT CAT AGA TAC GGA GAT GGT GGG TCC AGT TTC CAA AGC ACC ACA GGG CAC TGT GTA CAC ATG AGG GGA TTA CCT TAC AGA GCT ACT GAA AAT GAC ATT TAC AAT TTT TTC TCA CCT CTT AAC CCC ATG AGA GTA CAC ATT GAA ATT GGA CCT GAT GGC AGA GTT ACT GGT GAG GCA GAT GTT GAA TTT GCT ACT CAT GAA GAT GCC GTG GCA GCT ATG GCA AAA GAT AAG GCT AAT ATG CAA CAC AGA TAT GTG GAG CTC TTC TTA AAT TCT ACT GCA GGA ACA AGT GGT GGG GCT TAT GAT CAC AGC TAT GTA GAA CTC TTT TTG AAT TCT ACA GCA GGG GCA AGT GGT GGT GCT TAT GGT AGC CAA ATG ATG GGA GGG ATG GGC TTA TCC AAC CAG TCT AGT TAT GGG GGT CCT GCT AGC CAG CAG CTG AGT GGT GGT TAC GGG GGT GGT TAT GGT GGT CAG AGC AGT ATG AGT GGA TAT GAC CAA GTT CTG CAG GAA AAT TCC AGT GAC TAT CAG TCA AAC CTT GCG XXX
Fasta headers contain the following fields separated with pipe-with-whitespaces (‘ | ‘):
-
projection: projection name in TOGA2 notation (input_transcript#chain_number(s)) (ENST00000393432.9#HNRNPH1#13270) -
chain_id: number of the chain the projection was annotated through (13270); -
CODON: keyword indicating that the sequence comes from the codon alignment file; -
source: indicates whether the sequence corresponds to reference transcript (REFERENCE) or query projection (QUERY)
Multi-FASTA file listing the nucleotide sequence alignments per exon between reference and query exons. Compressed into gzip format by default. Reference-query exon pairs are presented projection-wise in the ascending exon number order. Exon numeration starts with 1 and follows exon order in the reference transcript. This means, exons merged in the query due to precise intron deletion are still presented as separate entries; likewise, exons split in the query due to intron gain are presented as single entities, with query intron sequence given in lowercase. Note that only coding sequence exons are considered, even if untranslated region annotation was not disabled.
>ENST00000270112.7#HUNK#58 | 9 | 58 | reference_exon
GCCTCTCTGGACACCTGGACACGAGATCTTGAATTCCATGCCGTGCAG
>ENST00000270112.7#HUNK#58 | 9 | 58 | scaffold_3:1435950-1435998 | 81.25 | 74.42 | scaffold_3:1435950-1435998 | INC | ORTHOLOG | query_exon
GCCTCCCTGGACGCCTGGACGCGGGACCTGGACTTCCCTGCCGTGCGG
Fasta headers contain the following fields, separated with a pipe-with-whitespaces (‘ | ’) delimiter:
- For reference exons:
-
projection: projection name in TOGA2 notation (input_transcript#chain_number(s)) (ENST00000270112.7#HUNK#58) -
exon: exon number (9) -
chain_id: number of the chain the projection was annotated through (58) -
reference_exon: keyword to indicate that the following sequence is the reference exon
-
- For query exons:
-
projection: projection name in TOGA2 notation (input_transcript#chain_number(s)) (ENST00000270112.7#HUNK#58) -
exon: exon number (9) -
chain_id: number of the chain the projection was annotated through (58) -
coordinates: query coordinates given as scaffold:start-end (scaffold_3:1435950-1435998) -
%id: %nucleotide identity between the reference and query exon (81.25) -
%blosum: %BLOSUM score between the reference and query exon (74.42) -
expected_coordinates: expected query coordinates, given as scaffold:start-end, based on the chain alignment (scaffold_3:1435950-1435998). Note: this can differ between the final annotated exon coordinates, as TOGA2 may shift splice sites or use additional flanking space for exon alignment. This field is mostly relevant for debugging purposes. -
expected_locus: correspondence to the expected locus (INC). This indicates whether the final exon coordinates overlap the expected locus with INC indicates that the locus intersects the expected coordinates by at least one base and EXCL indicates otherwise. -
orthology_status: orthology status (ORTHOLOG). The keywords that are possible areORTHOLOGfor orthologous projections,PARALOGfor paralogous projection, andPROCESSED_PSEUDOGENEfor processed pseudogenes/retrogenes. -
query_exon: keyword to indicate that the following sequence is the query exon
-
Aggregated pairwise amino acid sequence alignments for annotated projections in FASTA format. Compressed into gzip format by default.
>NM_001252010.2#LUZP2#436 | PROT | REFERENCE
MKFSPAHYLLPLLPALVLSTR-QDYEELEKQLKEVFKERSTILRQLTKTSRELDGIKVNLQSLKNDEQSAKTDVQKLLELGQKQREEMKSLQEALQNQLKETSEKAEKHQATINFLKTEVERK-SKMIRDLQNE---AQQLTDLEQKLAVAKNELEKAA-LD-R-ESQMKAMKETV-QLCLTSVFRDQPPPPLSLITSNPTRMLLPPRNIASKLPDAAAKSK---PQQSASGNNESSQV-EST---KEGNPSTTACDSQD--EGR-PCSMKHKESPPSNATAETEPIPQ-KLQMPPCSECEVKKAPEKPLTSFEGMAAREEKIL*
>NM_001252010.2#LUZP2#436 | PROT | QUERY
MTCCPXLLILPLLQALVLSTSC----------K-CFPEKS--LKKLSNT---------NLK---------KQDNEKLV-LISRKC---------LKNE--EVKEK--KTQSGMILMATGLLRKVGRAV-DLTVEKKK-------EEELV------QKAAFLDNRG-----ATREMISQ---------ENNPPLNLIIEA---------GLIPKLVDFL---KEPREQQSS--QTKASQLTEQTLLRKE-NPGT-------NLERRILCXQMHFESCKAY----------SRSQ-----------APEHP------AEAKEDA--*
Fasta headers contain the following fields, separated with a pipe-with-whitespaces (‘ | ’) delimiter:
-
projection: projection name in TOGA2 notation (input_transcript#chain_number(s)) (NM_001252010.2#LUZP2#436); -
PROT: keyword indicating that the sequence comes from the protein alignment file; -
source: indicates whether the sequence corresponds to reference transcript (REFERENCE) or query projection (QUERY)
A multi-FASTA file containing annotated query transcripts (coding sequences only). Compressed into gzip format by default.
>ENST00000315273.4#ASAP2#4707
ATGCCAGAACAGATCTCCGTGTCGGAATTCATAGCCGAGACCCTTGAGGACTACAAGGCGCCCACGGCCTATAGCTTCACCACGCGCACGGCCCAGTGCCGGGACACCATGTCGGCCATCGAGGAGGCCTTGGAGAAAATACTTAGTCTTACAACTCTCACGGGCGACGGCTTCAAGTTCCAATTTTTTGATGCCATTGTAAGTATGGGTGATCTACACAATAAATTGATTGACAAGAATTATAATGACTATAAAGAGACTTGCCAAGATTGAAGAATTCGAATCAATCCACACTTATCTCCTTGTACTAAGGTCAAATCTAAGTGGATCAAGGAACTTCATATAAAACCAGAGACACTGAAACTTATAGAGGAGAAAGTGGGGAAAAGCCTTGAAGATATGGGCACAGGGGAAAAATTCCTGAACAGAACAGCAATGGCTTGTGCTGTCTATGAAGTGCAGTGTGGGTATCCTGTACTTATGGTTATTATTCACTTTTTAGCACCACTCCAGCCATCTTTTAACAGAGCTACACTTGGGAAAACCATCATATTAAAAGACATTGTCAAAGTTCAAGAAGAAATGAAAGGTAATTTTCACGAGGGCCAAGATGCTCAGATTAAAAAATATTTTATGTTTGAGTCAATGCAGGGTTCTGGGCAGCCCCAGAGCAGCTTAGGACAGGTCAGGAAATTGAGGCTGAGAGGTAAGAAGAACCCTATCTCACTGGTTAGCTGTGTGTTGTCCCACTCAGAAGAGGTGGCTGAGCCATGTGAGACAGCTTACCACTCACTGGAGGCTTTCTGTAGGGCTCCTACCCTGCCCACAGAGGGCCTGGAGGCCCTTAAGGACACCAGACAGAACATTTCAGACCAGATGATTTATTCTATGATCCCAGCTGACTTTGTGAGGATGAAGAAGATGGACAACTGGCCAGCAAGTAACAGTGGGAGAATCAGTCCCAAGAGTTTCTCAAGACACACTGAGATAGGTGCTCAGTGCTCTCACCTTCTGTCTGTCTGTCTGTCTACCAGGCTGGCCTTGATCTCAGAAATCTGCCTGCCTCTGCCTCCCAAAGGCATTTCTTTTAAGTATATCGACGTTATTTTTTATTTATCTATTTTTAACTTGTCTGCAAAACATAAGGAAGCTGTTCAAGCCAGGGAATCAAGCCAGCAAAACAAAGATACATTAAGCAGGGAGAGGAGACTCAACCTCAGAGTTCTGGTGTGACAGCCATGAGAAGCAGCAGTACATGTTATCCAGCTCAATCAGAAAAAATCCTTTATGTGAAGGAAATGATATCCACATGGAGGAGGGGTTTCTTCCCCACCCCACCTCATCCAATCAGAATTCACACATGCTCCAGGCACAAGCTTAGGTCATTCACAGATAGGCTGACAGGTCTGGATCACGAGGAACTGTCCAGCCTGGCAGGCAGCCCACACATGCAGCCTCACTATGTGAAAAGTCACACTCTAGAAGGTTCTAGGTCCCTGTGTAGAAAGTGCCTGCCAATGTCTGCAGCACTGGCTTTTGTCAGAGACCATCCCTTGAGAAAGCAAGCCCTTCACAATCTCCCTAGGCAGGACTCTAACCCATTCTCCAAGACGATGTCCCTCCCTGCTGCCCCCATGCACCAAGATCTTAGTGCTGTCAAATCTAAGTGGATCAAGGAACTTCACATAAAACCAGAGACACTGAAACTTATAGAGGAGAAAGTGATTTGGCACCCTCACACAGACATACATACAAGAAAAACACCAATGCCTGTGAAATAA
Headers contain projection names in TOGA2 format (input_transcript#chain_number(s)).
Compressed and indexed file in UCSC 2bit format that contains all query transcript sequences.
Fasta headers are query projection identifiers. In the sequence line, all coding nucleotides are uppercase, with lowercase n symbols being used to delimit individual exon sequences. If two or more n symbols follow each other, the respective exon(s) were not aligned or were classified as missing/deleted by TOGA2.
Note that this is not aligned sequence; hence, there are no deletion symbols in this file.
>ENST00000313735.11#NOS2#2666
ATGTATGCTGCATGCTTTTTCTCCCTTTTATTGGAGGTTAACAGTTCTTT
CTCCTTCCAGTTnCAGGCTCAGTACTTCACGGGGTGTGGAAATGnACATC
ATGGCACAATCTTACCGTGAATACCATGTTCATTAAAAAGAAAGCAAAAT
...
Warning
This file is heavily optimised for downstream analysis. Do not use it for exon nucleotide sequence extraction unless you have reasons to do so and you know what you’re doing.
An extensive table of all transcripts/projections rejected at any point of TOGA2 pipeline. Columns are:
-
level: entity level; TRANSCRIPT for reference transcripts/isoforms, PROJECTION for individual projections in the query; -
item: transcript/projection identifier; -
segment: technical field implemented for debugging purposes; will likely be removed in the future releases -
rejection_reason: a verbose explanation for the reasons behind rejecting the entity; -
rej_id: a concise rejection reason identifier; -
loss_status: loss status inferred for the entity at the moment of rejection For breakdown on rejection reasons, see Item rejection
A short summary of orthology score prediction, gene loss analysis, query gene inference, and orthology classification steps. The same summary is printed to TOGA2 run log at the end of the pipeline. Consult it if you need a short reference on TOGA2 results.
####################################################################################################
#### TOGA2 run summary
####################################################################################################
Reference genome: test_input/hg38.micro_sample.2bit
Query genome: test_input/q2bit_micro_sample.2bit
Genome alignment chain file: test_input/align_micro_sample.chain
Reference annotation file: test_input/annot_micro_sample.bed
Output directory: micro_test_results
NOTE: Reference isoform file was not provided. All gene level statistics are presented assuming that each reference transcript represents a separate gene
####################################################################################################
Orthology prediction statistics:
Reference transcripts level::
#reference transcripts subjected to classification: 3
#reference transcripts with >=1 predicted ortholog: 3 (100.0%)
#reference transcripts with 1 predicted ortholog: 3 (100.0%)
#reference transcripts with no predicted orthologs: 0 (0.0%)
#reference transcripts with no classifiable projections: 0 (0.0%)
#predicted processed pseudogene/retrogene projections 0:
Reference genes level:
#reference genes with >=1 predicted ortholog: 3 (100.0%)
#reference genes with 1 predicted ortholog: 3 (100.0%)
#reference genes with no predicted orthologs: 0 (0.0%)
####################################################################################################
Gene loss summary statistics:
loss classes considered for assessing gene presence: FI,I,PI,UL
Query projections level:
#Fully Intact (FI) - 1 (33.333%)
#Intact (I) - 1 (33.333%)
#Partially Intact (PI) - 1 (33.333%)
#Uncertain Losses (UL) - 0 (0.0%)
#Lost (L) - 0 (0.0%)
#Missing (M) - 0 (0.0%)
#projections considered present: 3 (100.0%)
#projections considered lost/missing: 0 (0.0%)
Reference transcripts level:
#Fully Intact (FI) - 1 (33.333%)
#Intact (I) - 1 (33.333%)
#Partially Intact (PI) - 1 (33.333%)
#Uncertain Losses (UL) - 0 (0.0%)
#Lost (L) - 0 (0.0%)
#Missing (M) - 0 (0.0%)
#trancripts considered present: 3 (100.0%)
#transcripts considered lost/missing: 0 (0.0%)
Reference genes level:
#Fully Intact (FI) - 0 (33.333%)
#Intact (I) - 0 (33.333%)
#Partially Intact (PI) - 0 (33.333%)
#Uncertain Losses (UL) - 0 (0.0%)
#Lost (L) - 0 (0.0%)
#Missing (M) - 0 (0.0%)
#reference genes considered present: 3 (100.0%)
#reference genes considered lost/missing: 0 (0.0%)
####################################################################################################
Orthology resolution:
#reference genes: 3
#query genes: 3
#with defined orthology: 3 (100.0%)
#lost, missing, or lacking defined orthology: 0 (0.0%)
Reference gene orthology class composition:
one2one: 3 (100.0%)
one2many: 0 (0.0%)
many2one: 0 (0.0%)
many2many: 0 (0.0%)
one2zero: 0 (0.0%)