For my Theoretical Foundations of Machine Learning course, I was presented with the following task: In NMR experiments, molecules are placed in a strong magnetic field, resulting in the molecules resonating at a specific frequency. These frequencies can then be used to infer information about the molecules' chemical structures. The given dataset comprises of 1503 spectra of diterpenes, classified into 23 different classes according to their skeleton structure. The goal is to predict the class of a given spectrum.
For a more detailed description of the classification task and the data, please refer to the original paper by Dzeroski et al. (I have included the paper in the repository).
The task seems straight forward at first, but as my team and I noticed quickly, there is a catch. In this repository, I will show a possible solution that I implemented and got a pretty good result.
The data is described by:
- expert designed features (we ignore these),
- an ID
- a number of resonance frequencies with their "multiplicity"
- a class label
Usually, in machine learning, we work with
-
order does matter:
$i$ th attribute$x_i$ of$x$ corresponds to$x_i'$ of$x'$
On this dataset, the order of the multiplicity-frequency pairs
So, the algorithm
-
$x = ((d, 53), (d, 52), (q, 72))$ and $x = ((q, 72), (d, 52), (d, 53))$ - should be treated exactly the same:
$f(x) = f (x')$
Basically, we should treat
To solve this, I used a permutation-invariant kernel (I used the RBF kernel). Specifically, you can compare all pairs of frequencies with the same multiplicity and then sum them up. Then, I just used a SVM. See code for more details! Note: this is not the full assignment, just a part I worked on. It includes testing some code and no full documentation.