Before you get started, please head over to Moodle and download the theses.tsv (tab separated values) data set, that contains about 3000 thesis titles along with their type (diploma, bachelor, master) and category (internal/external).
Here is an example:
1995 extern Diplom Analyse und Leistungsvergleich von zwei Echtzeitsystemen für eingebettete Anwendungen
1995 intern Diplom Erfassung und automatische Zuordnung von Auftragsdaten für ein Dienstleistungsunternehmen mit Hilfe von Standardsoftware - Konzeption und Realisierung
As you can see, the format is
date<tab>{intern,extern}<tab>{Diplom,Bachelor,Master}<tab>Title...
Tomas Mikolov's original paper for word2vec is not very specific on how to actually compute the embedding matrices. Xin Ron provides a much more detailed walk-through of the math, I recommend you go through it before you continue with this assignment.
Now, while the original implementation was in C and estimates the matrices directly, in this assignment, we want to use pytorch (and autograd) to train the matrices. There are plenty of example implementations and blog posts out there that show how to do it, I particularly recommend Mateusz Bednarski's version.
- Familiarize yourself with skip-grams and how to train them using pytorch.
- Use the titles from
theses.tsvto compute word embeddings over a context of 5. Note: it may be helpful to lower-case the data. - Analyze: What are the most similar words to "Konzeption", "Cloud" and "virtuelle"
- Play: Using the computed embeddings: can you identify the most similar theses?
Implement a basic (word-based) RNN-LM for the theses titles. You can use either the embeddings from above or learn a dedicated embedding layer.
- Implement, evaluate: Using 5-fold cross-validation, what is the average perpexity?
- recall assignment 2: what perplexity does a regular 4-gram have on the same splits?
- Sample a few random theses titles from the RNN-LM
- are these any good/better than from assignment 2?
The theses.tsv also contains type (diploma, bachelor, master) and category (internal/external) for each thesis.
In this part, we want to classify whether the thesis is bachelor or master; and if it's internal or external.
Since pytorch provides most things sort-of out of the box, compare the following on a 5-fold x/validation: (vanilla) RNN, GRU, LSTM, bi-LSTM; which activations did you use?
- Filter out all diploma theses; they might be too easy to spot because they only cover "old" topics.
- Train and evaluate your models on a 5-fold cross-validation; as in RNN-LM, you can either learn the embeddings or re-use the ones from the skip-gram.
- Assemble a table: Recall/precision/F1 measure for each of above listed recurrent model architectures. Which one works best?
- Bonus: Apply your best classifier to the remaining diploma theses; are those on average more bachelor or master? :-)