Framework for Interpretable Neural Networks for Genetics
GenNet is a command line tool that can be used to create neural networks for (mainly) genetics. GenNet gives the opportunity to let you decide what should be connected to what. Any information that groups knowledge can therefore be used to define connections in the network. For example, gene annotations can be used to group genetic variants into genes, as seen in the first layer of the image. This creates meaningful and interpretable connections. When the network is trained the network learns which connections are important for the predicted phenotype and assigns these connections a higher weight. For more information about the framework and the interpretation read the paper:GenNet framework: interpretable neural networks for phenotype prediction
The Gennet framework is based on tensorflow, click here for the custom layer.
Follow the instructions below to get started.
Tip
Check the A to Z Colab tutorial for an overview on how to use GenNet with your own data!
-
GenNet can use CPU or GPU (which can be quite a bit faster for deeper networks) If you want to use cuda, please make sure you have the correct version of CUDA installed CUDA. GenNet has been tested for:
- Python 3.5, CUDA 9.1, Tensorflow 1.12.0
- Python 3.5, CUDA 10.0, Tensorflow 1.13.1
- Python 3.5, CUDA 10.0, Tensorflow 2.0.0-beta1
- Python 3.6-3.7, CUDA 10.1, Tensorflow 2.2.0 (currently default and recommended)
- Python 3.* Tensorflow 2.2 to 2.5 CPU
Open terminal. Navigate to the a place where you want to store the project. Clone the repository:
git clone https://github.com/arnovanhilten/GenNet
Navigate to the home folder and create a virtual environment
cd ~
python3 -m venv env_GenNet
This automatically installs the latest Tensorflow version for which GenNet has been tested. If you have an older version of CUDA install the appriopriate tensorflow-gpu by
pip install tensorflow-gpu==1.13.1
(change 1.13.1 to your version).
Activate the environment
source ~/env_GenNet/bin/activate
Install the packages
pip3 install --upgrade pip
pip3 install -r requirements_GenNet.txt
GenNet is ready to use!
Navigate to the GenNet folder and use the following command to run the example:
python GenNet.py train -path ./examples/example_classification/ -ID 1
Check the wiki for more info!
NOTE: In python indices start from zero
As seen in the overview the commmand line takes 3 inputs:
- genotype.h5 - a genotype matrix, each row is a sample/subject/patient, each column is a feature (i.e. genetic variant). The genotype file can be automatically generated from plink files and VCF files using
python GenNet.py convert
, usepython GenNet.py convert --help
for more options or check HASE wiki convert - subject.csv - a .csv file with the following columns:
- patient_id: am ID for each patient
- labels: phenotype (with zeros and ones for classification and continuous values for regression)
- genotype_row: The row in which the subject can be found in the genotype matrix (genotype.h5 file)
- set: in which set the subject belongs (1 = training set, 2 = validation set, 3 = test, others= ignored)
- topology - This file describes the whole network: each row should be a "path" of the network, from input to output node. This file defines thus each connections in the network, giving you the freedom to design your network the way you want. In the GenNet framework we used biological knowledge such as gene annotations to do define meaningful connections, we included some helper functions to generate a topology file using Annovar. See the topoogy help for more information:
python GenNet.py topology --help
Topology example:
layer0_node | layer0_name | layer1_node | layer1_name | layer2_node | layer2_name |
---|---|---|---|---|---|
0 | rs916977 | 0 | HERC2 | 0 | Ubiquitin mediated proteolysis |
1 | rs766173 | 1 | BRCA2 | 1 | Breast cancer |
5 | rs1799944 | 1 | BRCA2 | 1 | Breast cancer |
6 | rs4987047 | 1 | BRCA2 | 1 | Breast cancer |
1276 | SNP1276 | 612 | UHMK1 | 2 | Tyrosine metabolism |
NOTE: It is important to name the column headers as shown in the table.
The first genetic variant in the genotypefile (row number zero!), named rs916977, is connected to the HERC2 node in the first layer. The HERC2 gene is node number zero. This node is conncted to the 'Ubiquitin mediated proteolysis' pathway which is the first node in the following layer. The next node is the end node which should not be included.
The second genetic variant 'rs766173' is connected to BRCA2 (node number 1 in the first layer), followed by the breast cancer pathway (node number 1 in the layer2), folowed by the end node.
The sixth(!) genetic variant 'rs1799944' is also connected to BRCA2 (whic was node number 1 in the first layer), followed by the breast cancer pathway (again node number 1 in the layer2), folowed by the end node.
All rows together describe all the connections in the network. Each layer should be described by a column layer#_node and a column layer#_name with # denoting the layer number.
Tip: Check the topology files in the examples folder.
Open the command line and navigate to the GenNet folder. Start training by:
python GenNet.py train -path {/path/to/your/folder} -ID {experimment number}
For example:
python GenNet.py train -path ./examples/example_classification/ -ID 1
or
python GenNet.py train -path ./examples/example_regression/ -ID 2 -problem_type regression
Choose from: convert, topology, train, plot and interpret. For the options check the wiki or use:
python GenNet.py convert --help
python GenNet.py train --help
python GenNet.py plot --help
python GenNet.py topology --help
python GenNet.py interpret --help
After training your network it saved together with its results. Results include a text file with the performance, a .CSV file with all the connections and their weights, a .h5 with the best weights on the validation set and a plot of the training and validation loss. Using these files we can create visualizations to better understand the network.
The .CSV file with the weights can be used to create your own plot but python GenNet.py plot
also has standard plots available. First we calculate the relative importance by multiplying all the weights between the output and each input. This can then be used to see the importance of each gene:
Or can be used in a Sunburt plot to get an overview of the whole network!
GenNet offers a number of intepretation methods, to find important features and interacting features.
- get_weight_scores: uses the weights to calculate the importance of each feature and node
- DeepExplain: uses the gradient (see DeepExplain) to calculate the importance
- RLIPP: uses logistic regression with signals to and from the node to calculate a measure of non-linearity for all nodes
- NID: Finds interacting features based on the features with the strongest weights
- DFIM: perturbs each input (or N inputs in the order of importance), and tracks which other features change importance to find interacting features
- PathExplain: Uses the Expected Hessian to find interacting features
For more information use: python GenNet.py interpret --help
The original jupyter notebooks can be found in the jupyter notebook folder. Navigate to the jupyter notebook folder and start with jupyter notebook
. The notebooks are not updated but can be a useful source to understand the main code and/or to create .npz masks (to define connections between layers). For more freedom in designing your own networks you can define your network here and create masks using the notebooks.
A to Z tutorial in Google Colabs, try GenNet with a single click!
GenNet is also available on Superbio.ai!
Toy with the demo with the basic principles online!
(Deprecated) Jupyter notebooks
For questions or comments make an issue or mail to: [email protected]