Chemical Effect Prediction Using Chemical Fingerprints.

This repository aims at predicting chemical effects in the T.E.S.T. datasets (U.S. EPA).

Problem

Each dataset contains training and testing data. Where each sample is a CAS number and a chemical concentration.

Methods

We use three models:

Simiarity model. Prediction is the average of the k closest (most similar) chemicals in the training set. Hyperparameters: k
FDA model. The training data is clustered into k clusters. During prediction a model is fitted to the cluster, which is used for prediction. Hyperparameters: k, clustering model, prediction model, model parameters.
Ensemble model (auto-sklearn). This library finds the optimal ensemble model for the given problem. Hyperparameters: optimization time (total and per model).

We report R2 scores. See table at bottom for a excerpt.

Installation

virtualenv env -p python3 
source env/bin/activate
pip3 install -r req.txt

Usage

usage: fingerprint_learning.py [-h] [-d DATASETS [DATASETS ...]] [-c CONFIG]
                               [--cv] [--fp] [--sd] [--p]

Chemical Effect Prediction Using Chemical Fingerprints.

optional arguments:
  -h, --help            show this help message and exit
  -d DATASETS [DATASETS ...], --datasets DATASETS [DATASETS ...]
                        Datasets (if empty: all datasets)
  -c CONFIG, --config CONFIG
                        Config file. See LC50_config.txt for example
  --cv                  Run Cross Validation
  --fp                  Fetch fingerprints from PubChem, and save to txt for
                        faster execution later.
  --sd                  Scale labels.
  --p                   Predict mode, will output file dataset_prediction.txt.
                        Provide test file with fingerprints/CIDs and labels
                        (these are ignored).

See data folder for dataset options (specify only name, i.e. LC50, not LC50_train.csv). See config folder for example configurations. A new optimal configuration will be created from the cross validation process. The labels for regression can be scaled between 0 and 1, this can be favourable for certain model. During prediction, the program will gather training and test data in the way as before, but test labels will be ignored. If the training and testing files has columns (CID,label), then use the '--fp' flag to gather fingerprints from the PubChem API.

Example

To run CV on the LC50 datasets (Fathead minnow and Daphnia magna) run:

python3 fingerprint_learning.py -d LC50 LC50DM --cv

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
LICENSE		LICENSE
README.md		README.md
fingerprint_learning.py		fingerprint_learning.py
models.py		models.py
req.txt		req.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chemical Effect Prediction Using Chemical Fingerprints.

Problem

Methods

Installation

Usage

Example

Results

About

Uh oh!

Releases

Packages

Languages

License

NIVA-Knowledge-Graph/CEP

Folders and files

Latest commit

History

Repository files navigation

Chemical Effect Prediction Using Chemical Fingerprints.

Problem

Methods

Installation

Usage

Example

Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages