Skip to content

Latest commit

 

History

History
44 lines (37 loc) · 1.88 KB

evaluation.md

File metadata and controls

44 lines (37 loc) · 1.88 KB

Evaluating Language Models

EasyLM has builtin support for evaluating language models on a variety of tasks. Once the trained language model is served with LMServer, it can be evaluated against various benchmarks in few-shot and zero-shot settings.

LM Evaluation Harness

EasyLM comes with builtin support for lm-eval-harness, which can evaluate the language model on a variety of tasks. For example, you can use the following command to evaluate the langauge model served with the HTTP server:

python -m EasyLM.scripts.lm_eval_harness \
    --lm_client.url='http://localhost:5007/' \
    --tasks='wsc,piqa,winogrande,openbookqa,logiqa' \
    --shots=0

The lm_eval_harness script supports the following commnad line options:

  • tasks: a comma separated list of tasks to evaluate the language model on. The supported tasks are listed in the lm-eval-harness task table
  • shots: the number of shots to use for the evaluation.
  • batch_size: the batch size to use for each http request. Too large a batch size may cause the request to time out. Default to 1.
  • lm_client: the configurations for LMClient. See the LMClient documentation for more details.
  • logger: the configurations for the logger. See the logger documentation for more details.

Evaluating on MMLU

The served langauge model can also be evaluated with the MMLU benchmark. In order to run the evaluation, you'll need to use my fork of MMLU which supports EasyLM LMServer.

git clone https://github.com/young-geng/mmlu_easylm.git
cd mmlu_easylm
python evaluate_easylm.py \
    --name='llama' \
    --lm_server_url='http://localhost:5007' \
    --ntrain=5