![Fancy logo](https://github.com/EBjerrum/scikit-mol/raw/30c74b3648c0087bdb1b659bc67ba757d7498e9a/ressources/logo/ScikitMol_Logo_LightBG_300px.png?raw=true)
The intended usage is to be able to add molecular vectorization directly into scikit-learn pipelines, so that the final model directly predict on RDKit molecules or SMILES strings
As example with the needed scikit-learn and -mol imports and RDKit mol objects in the mol_list_train and _test lists:
pipe = Pipeline([('mol_transformer', MorganFingerprintTransformer()), ('Regressor', Ridge())])
pipe.fit(mol_list_train, y_train)
pipe.score(mol_list_test, y_test)
pipe.predict([Chem.MolFromSmiles('c1ccccc1C(=O)C')])
>>> array([4.93858815])
The scikit-learn compatibility should also make it easier to include the fingerprinting step in hyperparameter tuning with scikit-learns utilities
The first draft for the project was created at the RDKIT UGM 2022 hackathon 2022-October-14
Users can install latest tagged release from pip
pip install scikit-mol
or from conda-forge
conda install -c conda-forge scikit-mol
The conda forge package should get updated shortly after a new tagged release on pypi.
Bleeding edge
pip install git+https://github.com:EBjerrum/scikit-mol.git
There are a collection of notebooks in the notebooks directory which demonstrates some different aspects and use cases
- Basic Usage and fingerprint transformers
- Descriptor transformer
- Pipelining with Scikit-Learn classes
- Molecular standardization
- Sanitizing SMILES input
- Integrated hyperparameter tuning of Scikit-Learn estimator and Scikit-Mol transformer
- Using parallel execution to speed up descriptor and fingerprint calculations
- Using skopt for hyperparameter tuning
- Testing different fingerprints as part of the hyperparameter optimization
- Using pandas output for easy feature importance analysis and combine pre-exisitng values with new computations
- Working with pipelines and estimators in safe inference mode for handling prediction on batches with invalid smiles or molecules
We also put a software note on ChemRxiv. https://doi.org/10.26434/chemrxiv-2023-fzqwd
Scikit-Mol has been featured in blog-posts or used in research, some examples which are listed below:
- Useful ML package for cheminformatics iwatobipen.wordpress.com
- Boosted trees Data_in_life_blog
- Konnektor: A Framework for Using Graph Theory to Plan Networks for Free Energy Calculations
- Moldrug algorithm for an automated ligand binding site exploration by 3D aware molecular enumerations
- RandomNets Improve Neural Network Regression Performance via Implicit Ensembling
- WAE-DTI: Ensemble-based architecture for drug–target interaction prediction using descriptors and embeddings
- Data Driven Estimation of Molecular Log-Likelihood using Fingerprint Key Counting
- AUTONOMOUS DRUG DISCOVERY
Help wanted! Are you a PhD student that want a "side-quest" to procrastinate your thesis writing or are you simply interested in computational chemistry, cheminformatics or simply with an interest in QSAR modelling, Python Programming open-source software? Do you want to learn more about machine learning with Scikit-Learn? Or do you use scikit-mol for your current work and would like to pay a little back to the project and see it improved as well? With a little bit of help, this project can be improved much faster! Reach to me (Esben), for a discussion about how we can proceed.
Currently we are working on fixing some deprecation warnings, its not the most exciting work, but it's important to maintain a little. Later on we need to go over the scikit-learn compatibility and update to some of their newer features on their estimator classes. We're also brewing on some feature enhancements and tests, such as new fingerprints and a more versatile standardizer.
There are more information about how to contribute to the project in CONTRIBUTING
Probably still, please check issues at GitHub and report there
Scikit-Mol has been developed as a community effort with contributions from people from many different companies, consortia, foundations and academic institutions.
Cheminformania Consulting, Aptuit, BASF, Bayer AG, Boehringer Ingelheim, Chodera Lab (MSKCC), EPAM Systems,ETH Zürich, Evotec, Johannes Gutenberg University, Martin Luther University, Odyssey Therapeutics, Open Molecular Software Foundation, Openfree.energy, Polish Academy of Sciences, Productivista, Simulations-Plus Inc., University of Vienna
- Esben Jannik Bjerrum @ebjerrum, [email protected]
- Carmen Esposito @cespos
- Son Ha, [email protected]
- Oh-hyeon Choung, [email protected]
- Andreas Poehlmann, @ap--
- Ya Chen, @anya-chen
- Anton Siomchen @asiomchen
- Rafał Bachorz @rafalbachorz
- Adrien Chaton @adrienchaton
- @VincentAlexanderScholz
- @RiesBen
- @enricogandini
- @mikemhenry
- @c-feldmann