Material for the hands-on workshop in the "Applied Machine Learning Days at EPFL 2020"
Authors:
- Ioannis Partalas
- Georgios Balikas
- Eric Bruno
You should have installed the following python 3 packages:
numpy
pandas
scikit-learn
torch
umap-learn
seaborn
xgboost
In case you use Colab all these packages should be available. If it is not the case you can just use magic:
!pip install umap-learn torch seaborn
The notebooks use the ConceptNet Numberbatch embeddings. We provide a script to download them and extract them. You can do this by:
bash download_conceptNet.sh
You will need to install the LASER library. To do so you can just run the following bash script:
bash install_laser.sh
Finally, download the dataset that we will use during the workshop from the Semeval 2016 competition on aspect-based sentiment analysis. You will need to download the following datasets:
- English, Dutch, Russian, Spanish and Turkish from the domain restaurants subtask 1.
- Arabic from the Hotels domain subtask 1.
- The test data with the gold annotations for subtask 1.
These datasets use xml format but we will need to convert them in csv for this workshop. You can do that by executing the following script that you will find in the src folder:
python semeval2csv.py --infile INFILE --outfile OUTFILE [--train]
where you just need to specify the input and output files respectively and whether it is a train or test set. Create a directory named datasets under data and put their the generated csv files. You will have to change accordingly the naming convention for loading the files in the Dataset class.
The workshop structure is as follows:
- Brief introduction in text classification: Intro
- Introduction in cross-lingual word embeddings: Cross-lingual word embeddings intro
- Cross-lingual document classification:
- Zero-shot learning.
- Few-shot learning and fine tuning.
- Repeat the experiments for various language pairs
- Add hyper-parameter tuning
- Repeat the experiments with other embeddings (e.g., MUSE, Ferreira et al., etc..)
- For the target language use another domain
- Explore the world of Transformers (BERT etc.). You can take a look at huffingface