Skip to content
This repository has been archived by the owner on May 5, 2021. It is now read-only.

Gene Information

Gautier Koscielny edited this page Jun 29, 2018 · 6 revisions

Index of files

Gene information

Filename: gene_info.csv

All the gene features you might need to build your prediction model are stored in this file. It contains the symbol, identifiers (gene locus and protein), type (e.g. protein-coding or not), cellular location, protein class) of the genes as well as their Gene Ontology annotations.

Column name Description
symbol gene symbol
hgnc_id HGNC official identifier
entrez_id Entrez gene identifier
ensembl_gene_id ENSEMBL gene identifier
uniprot_id UniProt protein identifier (note that there can be several proteins for one gene)
locus_type type of this genomic locus
locus_group group classification for this genomic locus
go_id Gene Ontology (GO) term identifier
go_label Gene Ontology (GO) term name
evidence_type evidence type according to GO (please refer to )
reported_count how many times this type of evidence has been reported (useful for replicability)
protein_class ChEMBL druggable genome classification of the protein
target_class target class
topology_type topology information
target_location Cellular location
ExAC_LoF Resilient to Loss of Function according to ExAC
pc_mouse_gene_identity mouse ortholog
GTEX_median_all_tissues median expression across all GTEx tissues
description gene description

GTEx gene expression and disease relevance

Filename: gene_disease_gtex_tissue_expression.csv.gz

This compressed file contains the relation between 2 important pieces of information to build the prediction model:

Hence, it's possible to combine the tissue and expression in your model to assess if successful drug targets are also expressed at the protein-level.

Column name Description
entrez_id Entrez gene identifier
ensembl_gene_id ENSEMBL gene identifier
symbol gene symbol
disease_id disease identifier
disease_label disease name
tissue_label tissue name as described in GTEx
source GTEx version 6
max_fold_change gene expression fold change (if mRNA expression in the indicated tissue for this gene is at least 5-fold above the median tissue and within 5-fold of the highest expression tissue)
expression_score normalised gene expression score for max_fold_change

In the example below, the gene MUC7 is specifically expressed in the Salivary Gland.

gunzip -c gene_disease_gtex_tissue_expression.csv.gz | head -5
0,entrez_id,ensembl_gene_id,symbol,disease_id,disease_label,tissue_label,source,max_fold_change,expression_score
0,4589,ENSG00000171195,MUC7,EFO_0007383,Mumps virus infectious disease,Minor Salivary Gland,GTExv6,57385.21,0.99
1,4589,ENSG00000171195,MUC7,EFO_1000384,Mixed Tumor of the Salivary Gland,Minor Salivary Gland,GTExv6,57385.21,0.99
2,4589,ENSG00000171195,MUC7,EFO_0003826,salivary gland neoplasm,Minor Salivary Gland,GTExv6,57385.21,0.99
3,4589,ENSG00000171195,MUC7,EFO_1000344,Major Salivary Gland Carcinoma,Minor Salivary Gland,GTExv6,57385.21,0.99

We can double-check that on the Open Targets portal:

MUC7 RNA expression