-
Notifications
You must be signed in to change notification settings - Fork 4
Gene Information
- Gene information: gene_info.csv
- GTEx gene expression and disease relevance: gene_disease_gtex_tissue_expression.csv.gz
Filename: gene_info.csv
All the gene features you might need to build your prediction model are stored in this file. It contains the symbol, identifiers (gene locus and protein), type (e.g. protein-coding or not), cellular location, protein class) of the genes as well as their Gene Ontology annotations.
Column name | Description |
---|---|
symbol | gene symbol |
hgnc_id | HGNC official identifier |
entrez_id | Entrez gene identifier |
ensembl_gene_id | ENSEMBL gene identifier |
uniprot_id | UniProt protein identifier (note that there can be several proteins for one gene) |
locus_type | type of this genomic locus |
locus_group | group classification for this genomic locus |
go_id | Gene Ontology (GO) term identifier |
go_label | Gene Ontology (GO) term name |
evidence_type | evidence type according to GO (please refer to ) |
reported_count | how many times this type of evidence has been reported (useful for replicability) |
protein_class | ChEMBL druggable genome classification of the protein |
target_class | target class |
topology_type | topology information |
target_location | Cellular location |
ExAC_LoF | Resilient to Loss of Function according to ExAC |
pc_mouse_gene_identity | mouse ortholog |
GTEX_median_all_tissues | median expression across all GTEx tissues |
description | gene description |
Filename: gene_disease_gtex_tissue_expression.csv.gz
This compressed file contains the relation between 2 important pieces of information to build the prediction model:
- A) The relevant tissue for a disease from a systematic mining of the scientific literature (see this scientific report by Vinod Kumar and colleagues at GSK)
- B) The genes specifically expressed in the disease-affected tissue
Hence, it's possible to combine the tissue and expression in your model to assess if successful drug targets are also expressed at the protein-level.
Column name | Description |
---|---|
entrez_id | Entrez gene identifier |
ensembl_gene_id | ENSEMBL gene identifier |
symbol | gene symbol |
disease_id | disease identifier |
disease_label | disease name |
tissue_label | tissue name as described in GTEx |
source | GTEx version 6 |
max_fold_change | gene expression fold change (if mRNA expression in the indicated tissue for this gene is at least 5-fold above the median tissue and within 5-fold of the highest expression tissue) |
expression_score | normalised gene expression score for max_fold_change |
In the example below, the gene MUC7 is specifically expressed in the Salivary Gland.
gunzip -c gene_disease_gtex_tissue_expression.csv.gz | head -5
0,entrez_id,ensembl_gene_id,symbol,disease_id,disease_label,tissue_label,source,max_fold_change,expression_score
0,4589,ENSG00000171195,MUC7,EFO_0007383,Mumps virus infectious disease,Minor Salivary Gland,GTExv6,57385.21,0.99
1,4589,ENSG00000171195,MUC7,EFO_1000384,Mixed Tumor of the Salivary Gland,Minor Salivary Gland,GTExv6,57385.21,0.99
2,4589,ENSG00000171195,MUC7,EFO_0003826,salivary gland neoplasm,Minor Salivary Gland,GTExv6,57385.21,0.99
3,4589,ENSG00000171195,MUC7,EFO_1000344,Major Salivary Gland Carcinoma,Minor Salivary Gland,GTExv6,57385.21,0.99
We can double-check that on the Open Targets portal: