This code extracts a dataset of compound-target pairs from the open-source bioactivity database ChEMBL [Zdrazil2023].
The compound-target pairs are known to interact because
- they have at least one corresponding measured activity value in ChEMBL or
- they are part of a set of manually curated known interactions in ChEMBL.
Furthermore, the dataset contains a number of compound and target annotations to enable future analyses.
Previously, a similar dataset has been curated manually and has been used to investigate target-based differences in drug-like properties and ligand efficiencies [Leeson2021]. This code can generate an extended version of the previous dataset for every ChEMBL version from ChEMBL 26 onwards.
[Zdrazil2023]: Zdrazil et al., "The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods", Nucleic Acids Research, gkad1004, 2023, https://doi.org/10.1093/nar/gkad1004
[Leeson2021]: Leeson et al., "Target-Based Evaluation of “Drug-Like” Properties and Ligand Efficiencies", Journal of Medicinal Chemistry, 64(11), 7210-7230, 2021, https://doi.org/10.1021/acs.jmedchem.1c00416
The dataset for different ChEMBL versions from ChEMBL 26 onwards is available here.
Install the required dependencies with
pip install .
Note: Using Pandas version 2.2 will lead to warnings regarding the RDKit PandasTools when running the code. However, the final dataset is not impacted.
The default version of the dataset (the full dataset as a CSV file based on the newest ChEMBL version) can be generated by calling
python main.py -o <output_path>
An overview of the available arguments to modify the output is available by calling
python main.py --help
The full documentation is available here.
The corresponding preprint is available here.