Skip to content

Latest commit

 

History

History
 
 

baselines

Baseline methods for conversational response selection

This directory provides several baseline methods for conversational response selection. These help to benchmark the performance on the various datasets.

Baselines are intended to be relatively fast to run locally, and are not intended to be highly competitive with state of the art methods. As such they are limited to using a small portion of the training set, typically ten thousand randomly sampled examples.

Note that baselines only use the context feature to rank the response, and do not take into account extra_contexts.

Keyword-based

The keyword-based baselines use keyword similarity metrics to rank responses given a context. The TF_IDF method computes inverse document frequency statistics on the training set. Responses are scored using their tf-idf cosine similarity to the context.

The BM25 method builds on top of the tf-idf similarity, applying an adjustment to the term weights. See Okapi BM25: a non-binary model for further discussion of the approach.

Vector-based

The vector-based methods use publicly available neural network embedding models to embed contexts and responses into a vector space. The models implemented currently are:

  • USE - the universal sentence encoder
  • USE_LARGE - a larger version of the universal sentence encoder
  • ELMO - the Embeddings from Language Models approach
  • BERT_SMALL - the Bidirectional Encoder Representations from Transformers approach
  • BERT_LARGE - a larger version of BERT

all of which are loaded from Tensorflow Hub.

There are two vector-based baseline methods, one for each of the above models. The SIM method ranks responses according to their cosine similarity with the context vector. This method does not use the training set at all.

The MAP method learns a linear mapping on top of the response vector. The final score of a response with vector y given a context with vector x is the cosine similarity of the context vector with the mapped response vector:

the cosine similarity of x and y prime

where

y' is y after a linear mapping

and W and alpha are learned parameters. This allows for learning an arbitrary linear mapping on the context side, while making it easy for the model to interpolate with the SIM baseline using the residual connection gated by alpha. Vectors are L2-normalized before being fed to the MAP method, so that the method is invariant to scaling.

The parameters are learned on the training set, using the dot product loss from Henderson et al 2017. A sweep over learning rate and regularization parameters is performed using a held-out dev set. The final learned parameters are used on the evaluation set.

The combination of the five embedding models with the two vector-based methods gives ten baseline methods: USE_SIM, USE_MAP, USE_LARGE_SIM, USE_LARGE_MAP, ELMO_SIM, ELMO_MAP, BERT_SMALL_SIM, BERT_SMALL_MAP, BERT_LARGE_SIM and BERT_LARGE_MAP.

Running the baselines

Copy the data locally

Before running the baselines, it may be beneficial to copy the data locally. This will speed up reading the data, but it is not necessary if you run these on a Google cloud machine in the correct region.

If the dataset is stored in $DATADIR, a Google cloud storage path such as gs://your-bucket/reddit/YYYYMMDD, then copy a shard of the training set and the test set:

mkdir data
gsutil cp ${DATADIR?}/train-00001-* data/
gsutil cp ${DATADIR?}/test-00001-* data/

For Amazon QA data, you will need to copy two shards of the test set to get enough examples.

This provides a random subset of the train and test set to use for the baselines. Recall that conversational datasets are always randomly shuffled and sharded.

Run the baselines

When running vector-based methods, make use of tensorflow hub's caching to speed up results:

export TFHUB_CACHE_DIR=~/.tfhub_cache

Then run an individual baseline with:

python baselines/run_baseline.py  \
  --method TF_IDF \
  --train_dataset data/train-* \
  --test_dataset data/test-*

Note that the USE_LARGE, ELMO and all BERT-based models baselines are slow, and may benefit from faster hardware. For these methods set --eval_num_batches 100.

The run_baselines.sh script runs all the baselines on all the datasets.