Bacteria Prediction

Predicting bacteria species based on repeated lossy measurements of DNA snippets.

Solution to the Kaggle's Tabular Playground Series (Feb 2022) competition.

https://www.kaggle.com/c/tabular-playground-series-feb-2022/overview

The Task

The task is to classify 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss. In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count. In other words, the DNA segment ATATGGCCTT becomes A2T4G2C2. We want to accurately predict bacteria species starting from this lossy information.

The Data

We will predict bacteria species based on repeated lossy measurements of DNA snippets. Each row of data contains a spectrum of histograms generated by repeated measurements of a sample, each row containing the output of all 286 histogram possibilities (e.g., A0T0G0C10 to A10T0G0C0) which then has a bias spectrum (of totally random ATGC) subtracted from the results. The data (both train and test) also contains simulated measurement errors (of varying rates) for many of the samples, which makes the problem more challenging. The dataset is provided by Kaggle, in which the training set contains 200000 rows, each one with 286 dimensions.

Solution Approach

After an initial analysis of the dataset it's evident that we are working with very high dimensional data, which can make the learning process more difficult. We will start by trying to mitigate this problem by using a dimensionality reduction technique like PCA (Principal Component Analysis). Then we will use the transformed data to train different machine learning models in order to find the most appropriate architecture to solve the task.

Results

After some testing using PCA with a different number of components, it has been found that the problem could be efficently solved after reducing the 286 data's dimensions to less than 10. Cross validation was used for estimating both the best number of components to use in PCA and the specific hyper-parameters for the tested ML models. It emerged that the best model to use was a Random Forest, which was able to achieve a result in accuracy of 0.88 on the Kaggle's test set.

References

What is a K-mer: https://en.wikipedia.org/wiki/K-mer
The idea for this competition came from the following paper:
Wood et al., 2020, "Analysis of Identification Method for Bacterial Species and Antibiotic Resistance Genes Using Optical Data From DNA Oligomers", Frontiers in Microbiology, https://www.frontiersin.org/article/10.3389/fmicb.2020.00257

Installation and Dataset

The solution is provided as a Jupyter Notebook .ipynb

The Dataset can be found directly on Kaggle Website at the following URL:

https://www.kaggle.com/c/tabular-playground-series-feb-2022/data

After downloading it, create a folder "/kaggle/input/tabular-playground-series-feb-2022/" and place inside the two files "train.csv" and "test.csv".

I suggest opening the notebook file directly on the Kaggle website, where it's possibile to use their dataset without downloading them.

Libraries

The notebook uses the following libraries:

os
numpy
pandas
sklearn
tensorflow
matplotlib.pyplot

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Presentation		Presentation
README.md		README.md
bacteria-prediction.ipynb		bacteria-prediction.ipynb
bacteria-prediction.pdf		bacteria-prediction.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bacteria Prediction

The Task

The Data

Solution Approach

Results

References

Installation and Dataset

Libraries

About

Releases

Packages

Languages

Giullar/BacteriaPrediction

Folders and files

Latest commit

History

Repository files navigation

Bacteria Prediction

The Task

The Data

Solution Approach

Results

References

Installation and Dataset

Libraries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages