diterpene_classification

For my Theoretical Foundations of Machine Learning course, I was presented with the following task: In NMR experiments, molecules are placed in a strong magnetic field, resulting in the molecules resonating at a specific frequency. These frequencies can then be used to infer information about the molecules' chemical structures. The given dataset comprises of 1503 spectra of diterpenes, classified into 23 different classes according to their skeleton structure. The goal is to predict the class of a given spectrum.

For a more detailed description of the classification task and the data, please refer to the original paper by Dzeroski et al. (I have included the paper in the repository).

The task seems straight forward at first, but as my team and I noticed quickly, there is a catch. In this repository, I will show a possible solution that I implemented and got a pretty good result.

The catch: does order matter?

The data is described by:

expert designed features (we ignore these),
an ID
a number of resonance frequencies with their "multiplicity"
a class label

Usually, in machine learning, we work with $x \in \mathbb{R}^k$, with $k$ many features, and:

order does matter: $i$ th attribute $x_i$ of $x$ corresponds to $x_i'$ of $x'$

On this dataset, the order of the multiplicity-frequency pairs $(m, f)$ does not matter!
$m ∈ {s, d,t, q}$ and $f ≥ 0$
So, the algorithm $f(x)$ has to be order-independent, i.e., permutation-invariant:

$x = ((d, 53), (d, 52), (q, 72))$ and
$x = ((q, 72), (d, 52), (d, 53))$
should be treated exactly the same: $f(x) = f (x')$

Basically, we should treat $x$ as a set and not as a sequence.

(my) Solution

To solve this, I used a permutation-invariant kernel (I used the RBF kernel). Specifically, you can compare all pairs of frequencies with the same multiplicity and then sum them up. Then, I just used a SVM. See code for more details! Note: this is not the full assignment, just a part I worked on. It includes testing some code and no full documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
Diterpene structure elucidation from 13cnmr spectra with inductive logic programming.pdf		Diterpene structure elucidation from 13cnmr spectra with inductive logic programming.pdf
README.md		README.md
diterpene-classification.ipynb		diterpene-classification.ipynb
diterpene_shuf.csv		diterpene_shuf.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

diterpene_classification

The catch: does order matter?

(my) Solution

About

Releases

Packages

Languages

JasperDeLandsheere/diterpene_classification

Folders and files

Latest commit

History

Repository files navigation

diterpene_classification

The catch: does order matter?

(my) Solution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages