Skip to content

longbench_en

iMountTai edited this page Apr 29, 2024 · 3 revisions

LongBench Inference Script

LongBench is a multi-task, bilingual (Chinese and English), benchmark aimed at evaluating the long text understanding capabilities of large language models, covering multiple long-text application scenarios. Below is an introduction to the LongBench evaluation method.

Environment Setup

Configure your environment according to the official requirements.txt. Here are the necessary dependencies:

datasets
tqdm
rouge
jieba
fuzzywuzzy
einops
torch>=2.0.1
transformers>=4.40.0

Data Preparation

There is no need to download data separately. The prediction script will automatically download the required data from 🤗 Datasets.

Run the Prediction Script

Execute the following script for inference:

model_path=path/to/llama-3-chinese
output_path=path/to/output_dir
data_class=zh
with_inst="auto"
max_length=7680

cd scripts/longbench
python pred_mxitral.py \
    --model_path ${model_path} \
    --predict_on ${data_class} \
    --output_dir ${output_dir} \
    --max_length ${max_length} \
    --use_flash_attention_2 \
    --with_inst ${with_inst}

Parameter Explanation

  • --model_path ${model_path}: Directory where the evaluation model is located (complete Llama-3-Chinese or Llama-3-Chinese-Instruct model, not LoRA)

  • --predict_on {data_class}: Specifies the task set for prediction, which can be en, zh, code, or combinations thereof, separated by commas, e.g., en,zh,code

  • --output_dir ${output_dir}: Directory for storing evaluation results

  • --max_length ${max_length}: Maximum length of the prompt. Note that this length does not include the task-specific prompt

  • --gpus ${gpus}: To specify particular GPUs, use this parameter, e.g., 0,1

  • --e: Predict on the LongBench-E dataset. Refer to the official LongBench documentation for detailed explanations of LongBench-E.

  • --with_inst ${with_inst}:Whether use the system prompt and template of Llama-3-Chinese--8B-Instruct when constructing the instructions:

    • true:Use the system prompt and template on all tasks
    • false:Use the system prompt and template on none of tasks
    • auto:Use the system prompt and template on some tasks (default strategy of LongBench official code)

    We suggest setting --with_inst to auto when testing Llama-3-Chinese-8B-Instruct; setting --with_inst to false when testing Llama-3-Chinese-8B.

  • --use_flash_attention_2: Use Flash-Attention 2 for accelerated inference, otherwise SDPA is used for acceleration

After the model completes its run, prediction files (in jsonl format) for the respective tasks will be generated in ${output_dir}/pred/ or ${output_dir}/pred_e/ (depending on whether you tested on LongBench-E, i.e., whether -e was used). To calculate the performance metrics, execute the following command:

python eval.py --output_dir ${output_dir}

If -e was used during prediction, provide the -e parameter during evaluation as well:

python eval.py --output_dir ${output_dir} -e

The results will be stored under ${output_dir}/pred/result.json or ${output_dir}/pred_e/result.json. For example:

{
    "lsht": 42.0,
    "multifieldqa_zh": 50.28,
    "passage_retrieval_zh": 89.5,
    "vcsum": 16.41,
    "dureader": 34.15
}