Skip to content

Latest commit

 

History

History

ner_embeddings

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

🪐 Weasel Project: Comparing embedding layers in spaCy

This project contains the code to reproduce the results of the Multi hash embeddings in spaCy technical report by Explosion. The project.yml provides commands to download and preprocess the data sets as well as to run the training and evaluation procedures. Different configuration of vars correspond to different experiments in the report.

There are a few scripts included that were used during the technical report writing process to run experiments in bulk and summarize the results. The scripts/run_experiments.py runs multiple experiments one after the other by constructing and running spacy project run commands. The module scipts/collate_results.py summarizes the results of the same trials with multiple seeds. Finally, scripts/plot_results.py was used to produce the visualizations in the report. These are all small command line apps and you can learn more about the usage as usual with the --help flag.

The rows argument for the train-adjusted-rows command is provided as a list and this may lead to errors on Windows machines. Unfortunately, this might lead not being able to reproduce the MultiHashEmbed (adjusted) experiments from the paper on Windows using run_experiment.py. This is due to known issue with handling quotes on Windows and is something we are looking into. The config files can be edited by manually or in some other way to adjust the number of rows for the hash embedding layers. We apologize for the inconvenience.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

Command Description
prepare-datasets Download and preprocess all available data sets using the span-labeling-datasets project.
download-models Download spaCy models for their word-embeddings.
init-fasttext Initialize the FastText vectors.
make-tables Pre-compute token-to-id tables for MultiEmbed.
init-labels Initialize labels first before training
train Train NER model.
train-adjust-rows Train NER model with adjustable number of rows.
train-hash Train NER model with different number of hash functions. (only works with the multifewerhashembed.cfg)
evaluate Evaluate NER model.
evaluate-seen-only Evaluate NER model on the dev and tests sets only considering entities that appear in the training set.
evaluate-unseen-only Evaluate NER model on the dev and tests sets only considering entities that did not appear in the training set.

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow Steps
setup download-modelsinit-fasttextprepare-datasetsmake-tables
trial init-labelstrainevaluateevaluate-seen-onlyevaluate-unseen-only

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

File Source Description
assets/fasttext.en.gz URL English fastText vectors.
assets/fasttext.es.gz URL Spanish fastText vectors.
assets/fasttext.nl.gz URL Dutch fastText vectors.
span-labeling-datasets Git