CrossLingual-NLP-AMLD2020

Material for the hands-on workshop in the "Applied Machine Learning Days at EPFL 2020"

Authors:

Setup

You should have installed the following python 3 packages:

numpy
pandas
scikit-learn
torch
umap-learn
seaborn
xgboost

In case you use Colab all these packages should be available. If it is not the case you can just use magic:

!pip install umap-learn torch seaborn

The notebooks use the ConceptNet Numberbatch embeddings. We provide a script to download them and extract them. You can do this by:

bash download_conceptNet.sh

You will need to install the LASER library. To do so you can just run the following bash script:

bash install_laser.sh

Finally, download the dataset that we will use during the workshop from the Semeval 2016 competition on aspect-based sentiment analysis. You will need to download the following datasets:

English, Dutch, Russian, Spanish and Turkish from the domain restaurants subtask 1.
Arabic from the Hotels domain subtask 1.
The test data with the gold annotations for subtask 1.

These datasets use xml format but we will need to convert them in csv for this workshop. You can do that by executing the following script that you will find in the src folder:

python semeval2csv.py --infile INFILE --outfile OUTFILE [--train]

where you just need to specify the input and output files respectively and whether it is a train or test set. Create a directory named datasets under data and put their the generated csv files. You will have to change accordingly the naming convention for loading the files in the Dataset class.

Structure of the workshop

The workshop structure is as follows:

Brief introduction in text classification: Intro
Introduction in cross-lingual word embeddings: Cross-lingual word embeddings intro
Cross-lingual document classification:
1. Zero-shot learning.
2. Few-shot learning and fine tuning.

Exercises

Repeat the experiments for various language pairs
Add hyper-parameter tuning
Repeat the experiments with other embeddings (e.g., MUSE, Ferreira et al., etc..)
For the target language use another domain
Explore the world of Transformers (BERT etc.). You can take a look at huffingface

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CrossLingual-NLP-AMLD2020

Setup

Structure of the workshop

Exercises

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
data		data
notebooks		notebooks
src		src
README.md		README.md
download_conceptNet.sh		download_conceptNet.sh
extract_embeddings.py		extract_embeddings.py
install_laser.sh		install_laser.sh

ioannispartalas/CrossLingual-NLP-AMLD2020

Folders and files

Latest commit

History

Repository files navigation

CrossLingual-NLP-AMLD2020

Setup

Structure of the workshop

Exercises

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages