This repository is dedicted to the investigation and analysis of spanning trees between DrugMechDB and RTX-KG2 (version 2.8.4c & 2.9.0c) knowledge graphs.
DrugMechDB is a drug mechanism database representing pathways through a manually curated knowledge graph. RTX-KG2 is a Biolink Model-compliant standardized knowledge graph.
-
Investigate Spanning Tree Variance
- Conduct and extract spanning tree search within 3-hop neighborhood in RTX-K2 using the aligned and normalized nodes (only canonical IDs).
- Analyze differences between KG2 and DrugMechDB trees.
-
Statistical Analysis
- Assess the heuristic approaches in RTX-KG2 and DrugMechDB and quantify the extent of heuristics in RTX-KG2.
The script contains the data location and instructions necessary to perform the analysis.
-
k-hop_spanningtree_script.ipynb
is Jupyter Notebook script for spannin tree analysis. -
The
data
folder contains the CSV file,sorted_graph_DrugMechDB.csv
, that was used as input(canonical IDs). -
The
image
folder contains the spanning tree result in graph visuals format in PDF file.
The concept of the approach is based on Steiner Tree. The goal is to determine a tree structure that minimizes edge count while maximizing node coverage, treating nodes as a collection (bag of nodes) without considering specific connections.
There are 9695 graphs in DrugMechDB (DMDB); refer to their indication_paths.yaml
file. To demonstrate our method we chose the graphs with longest paths, KG2.9.0c
(691 grpahs) because it yielded more than KG2.8.0c
.
For our purpose, we used Graph-tool's all_paths
function to find paths.
When examning 1-hop neighbhood path in KG2.9.0c
, the resulting trees were disjointed meaning there were disconnected components (isolated nodes, disjointed graphs). To compensate this challenge, we generated vertices (node) pair combinations. For example a graph with 20 nodes will result in 190 vertex pair combinations:
Then, we expanded our search up to 3-hop neighborhood identifying the links (intermediary) that connect the disjointed components. If the vertex pair has a path within 3-hop, it will return the pair; essentially you are drawing a direct edge between those nodes.
To conduct or reproduce the spanning tree analysis you will need KG2 version 2.9.0c. Contact RTX-KG2 team to request access to the knowledge graphs. Please refer to this respository to obtain node normalized and aligned. To run the Jupyter Notebook script, follow these steps:
- Clone the repository to your local machine
git clone https://github.com/your-username/KG2_SpanningTree_Analysis.git
-
Navigate to the project directory.
-
Import the necessary input files. The location of the data for each analysis is provided in the script
-
Open and run the script using the command below:
jupyter notebook <name_of_the_script>.ipynb
The final output file that contains the spanning tree should in YAML format.
Note: The input data should be canonical IDs. Additionally, the nodes must be normaized and aligned.
We recommend you have the following specifications to successfuly run the script and analysis.
- Operatting system(s): Linux Ubuntu
- Programming Language:
- Bash
- Python 3.8.12 or higher
- Other requirements:
- Graph-tool (version 2.68)
- NetworkX (version 3.0)
- Anaconda (version 24.1.2)
- Multiple CPU cores
- Gonzalez-Cavazos, A.C., Tanska, A., Mayers, M. et al. DrugMechDB: A Curated Database of Drug Mechanisms. Sci Data 10, 632 (2023). https://doi.org/10.1038/s41597-023-02534-z
- DrugMechDB: GitHub Repository
- Wood, E.C., Glen, A.K., Kvarfordt, L.G. et al. RTX-KG2: a system for building a semantically standardized knowledge graph for translational biomedicine. BMC Bioinformatics 23, 400 (2022). https://doi.org/10.1186/s12859-022-04932-3
- P. Peixoto T. The graph-tool python library [Internet]. figshare; 2014. Available from: https://figshare.com/articles/dataset/graph_tool/1164194/14