The goal of this task was to make decisions on food similarity based on images and human judgments. The dataset consists of 10000 dish images, a sample of which is shown below.
Together with the image dataset, there is a set of triplets (A, B, C) provided, representing the human annotations. The human annotator judged the taste of dish A as more similar to the taste of dish B than the taste of dish C. A sample of such triplets is shown below.
The task is to train a neural network to predict the similarity of two dishes based on the previously unseen image triplets.
The solution is based on the Siamese neural network architecture, inspired by the approaches in Abbas, Moser (2021) and Wang et al. (2014). The network consists of three identical convolutional neural networks, each of which takes one of the images in the triplet as an input. These three neural networks serve as feature extractors and are based on the pre-trained ResNet-18 model with a modified final layer with 1024 output neurons.
For the training, we split the dataset into the train and validation sets (90/10) and used the Triplet Loss function. For a triplet represented by (
where
- Towards Data Science - A friendly Introduction to Siamese Networks - Sean Benhur (2020)
- Learning fine-grained image similarity with deep ranking - Wang et al. (2014)
- Siamese Network Training Using Artificial Triplets By Sampling and Image Transformation - Abbas, Moser (2021)
- Deep Residual Learning for Image Recognition - He et al. (2015)