Goal and Description

This project deals with classifying the bioactivity of the drug based on Lipinski molecular descriptors. The target protein is Tyrosine ABL kinase. Mutations in the ABL-kinase are associated with chronic myelogenous leukemia (CML). This is binary classification problem, where features (X) are Lipinski molecular descriptors. The target vector (y) for classification is bioactivity of the drug, which is either active or inactive.

Data collection

Data is obtained from the ChEMBL Database. The ChEMBL Database is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications. [Data as of March 25, 2020; ChEMBL version 26].

Data Preprocessing

Feature Engineering

The features (i.e. Lipinski molecular descriptors) were generated from the smiles obtained the ChEMBL Database. Lipinski's rule states that, in general, an orally active drug has no more than one violation of the following criteria:

No more than 5 hydrogen bond donors (the total number of nitrogen–hydrogen and oxygen–hydrogen bonds)
No more than 10 hydrogen bond acceptors (all nitrogen or oxygen atoms)
A molecular mass less than 500 daltons
An octanol-water partition coefficient (log P) that does not exceed 5

Labeling compounds/drugs as either being active, inactive or intermediate

The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be active while those greater than 10,000 nM will be considered to be inactive. As for those values in between 1,000 and 10,000 nM will be referred to as intermediate.

Exploratory data Analysis (EDA)

Machine-Learning Models

I have used 8 different machine learning classifiers to drug effectivenss classfication :

K-Nearest Neighbors (kNN)
Logistic regression
Decision Tree
Random Forest
Gradient Boosting
Support Vector Machine (SVM)
Neural Networks (Multi-level Perceptron : MLP)
XGBoost

Results

Statistical analysis | Mann-Whitney U Test

All of the 4 Lipinski's descriptors exhibited statistically significant difference between the actives and inactives.

Conclusion

All model seems to provide the decent performance based on 10-fold cross validation of the dataset. Gradient boostingseems to providing the best performance.
Neural Network achieves a highest score in predicting both classes.
Feature selection suggests the NumHDonors and MW are the most crucial factor for the successful prediction of bioactivity of the drug.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.ipynb_checkpoints		.ipynb_checkpoints
images		images
Machine-Learning.ipynb		Machine-Learning.ipynb
Preprocessing_and_EDA.ipynb		Preprocessing_and_EDA.ipynb
README.md		README.md
abl_kinase-data.csv		abl_kinase-data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goal and Description

Data collection

Data Preprocessing

Feature Engineering

Labeling compounds/drugs as either being active, inactive or intermediate

Exploratory data Analysis (EDA)

Machine-Learning Models

Results

Statistical analysis | Mann-Whitney U Test

Conclusion

About

Releases

Packages

Languages

Vikasdubey0551/ML-drug-effectiveness-abl-kinase

Folders and files

Latest commit

History

Repository files navigation

Goal and Description

Data collection

Data Preprocessing

Feature Engineering

Labeling compounds/drugs as either being active, inactive or intermediate

Exploratory data Analysis (EDA)

Machine-Learning Models

Results

Statistical analysis | Mann-Whitney U Test

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages