Skip to content

Predicting bacteria species based on repeated lossy measurements of DNA snippets

Notifications You must be signed in to change notification settings

Giullar/BacteriaPrediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Bacteria Prediction

Predicting bacteria species based on repeated lossy measurements of DNA snippets.

Solution to the Kaggle's Tabular Playground Series (Feb 2022) competition.

https://www.kaggle.com/c/tabular-playground-series-feb-2022/overview

The Task

The task is to classify 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss. In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count. In other words, the DNA segment ATATGGCCTT becomes A2T4G2C2. We want to accurately predict bacteria species starting from this lossy information.

The Data

We will predict bacteria species based on repeated lossy measurements of DNA snippets. Each row of data contains a spectrum of histograms generated by repeated measurements of a sample, each row containing the output of all 286 histogram possibilities (e.g., A0T0G0C10 to A10T0G0C0) which then has a bias spectrum (of totally random ATGC) subtracted from the results. The data (both train and test) also contains simulated measurement errors (of varying rates) for many of the samples, which makes the problem more challenging. The dataset is provided by Kaggle, in which the training set contains 200000 rows, each one with 286 dimensions.

Solution Approach

After an initial analysis of the dataset it's evident that we are working with very high dimensional data, which can make the learning process more difficult. We will start by trying to mitigate this problem by using a dimensionality reduction technique like PCA (Principal Component Analysis). Then we will use the transformed data to train different machine learning models in order to find the most appropriate architecture to solve the task.

Results

After some testing using PCA with a different number of components, it has been found that the problem could be efficently solved after reducing the 286 data's dimensions to less than 10. Cross validation was used for estimating both the best number of components to use in PCA and the specific hyper-parameters for the tested ML models. It emerged that the best model to use was a Random Forest, which was able to achieve a result in accuracy of 0.88 on the Kaggle's test set.

References

Installation and Dataset

The solution is provided as a Jupyter Notebook .ipynb

The Dataset can be found directly on Kaggle Website at the following URL:

https://www.kaggle.com/c/tabular-playground-series-feb-2022/data

After downloading it, create a folder "/kaggle/input/tabular-playground-series-feb-2022/" and place inside the two files "train.csv" and "test.csv".

I suggest opening the notebook file directly on the Kaggle website, where it's possibile to use their dataset without downloading them.

Libraries

The notebook uses the following libraries:

  • os
  • numpy
  • pandas
  • sklearn
  • tensorflow
  • matplotlib.pyplot

About

Predicting bacteria species based on repeated lossy measurements of DNA snippets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published