Predicting bacteria species based on repeated lossy measurements of DNA snippets.
Solution to the Kaggle's Tabular Playground Series (Feb 2022) competition.
https://www.kaggle.com/c/tabular-playground-series-feb-2022/overview
The task is to classify 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss. In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count. In other words, the DNA segment ATATGGCCTT becomes A2T4G2C2. We want to accurately predict bacteria species starting from this lossy information.
We will predict bacteria species based on repeated lossy measurements of DNA snippets. Each row of data contains a spectrum of histograms generated by repeated measurements of a sample, each row containing the output of all 286 histogram possibilities (e.g., A0T0G0C10 to A10T0G0C0) which then has a bias spectrum (of totally random ATGC) subtracted from the results. The data (both train and test) also contains simulated measurement errors (of varying rates) for many of the samples, which makes the problem more challenging. The dataset is provided by Kaggle, in which the training set contains 200000 rows, each one with 286 dimensions.
After an initial analysis of the dataset it's evident that we are working with very high dimensional data, which can make the learning process more difficult. We will start by trying to mitigate this problem by using a dimensionality reduction technique like PCA (Principal Component Analysis). Then we will use the transformed data to train different machine learning models in order to find the most appropriate architecture to solve the task.
After some testing using PCA with a different number of components, it has been found that the problem could be efficently solved after reducing the 286 data's dimensions to less than 10. Cross validation was used for estimating both the best number of components to use in PCA and the specific hyper-parameters for the tested ML models. It emerged that the best model to use was a Random Forest, which was able to achieve a result in accuracy of 0.88 on the Kaggle's test set.
-
What is a K-mer: https://en.wikipedia.org/wiki/K-mer
-
The idea for this competition came from the following paper:
Wood et al., 2020, "Analysis of Identification Method for Bacterial Species and Antibiotic Resistance Genes Using Optical Data From DNA Oligomers", Frontiers in Microbiology, https://www.frontiersin.org/article/10.3389/fmicb.2020.00257
The solution is provided as a Jupyter Notebook .ipynb
The Dataset can be found directly on Kaggle Website at the following URL:
https://www.kaggle.com/c/tabular-playground-series-feb-2022/data
After downloading it, create a folder "/kaggle/input/tabular-playground-series-feb-2022/" and place inside the two files "train.csv" and "test.csv".
I suggest opening the notebook file directly on the Kaggle website, where it's possibile to use their dataset without downloading them.
The notebook uses the following libraries:
- os
- numpy
- pandas
- sklearn
- tensorflow
- matplotlib.pyplot