longbench_en

LongBench Inference Script

LongBench is a multi-task, bilingual (Chinese and English), benchmark aimed at evaluating the long text understanding capabilities of large language models, covering multiple long-text application scenarios. Below is an introduction to the LongBench evaluation method.

Environment Setup

Configure your environment according to the official requirements.txt. Here are the necessary dependencies:

datasets
tqdm
rouge
jieba
fuzzywuzzy
einops
torch>=2.0.1
transformers>=4.40.0

Data Preparation

There is no need to download data separately. The prediction script will automatically download the required data from 🤗 Datasets.

Run the Prediction Script

Execute the following script for inference:

model_path=path/to/llama-3-chinese
output_path=path/to/output_dir
data_class=zh
with_inst="auto"
max_length=7680

cd scripts/longbench
python pred_mxitral.py \
    --model_path ${model_path} \
    --predict_on ${data_class} \
    --output_dir ${output_dir} \
    --max_length ${max_length} \
    --use_flash_attention_2 \
    --with_inst ${with_inst}

Parameter Explanation

--model_path ${model_path}: Directory where the evaluation model is located (complete Llama-3-Chinese or Llama-3-Chinese-Instruct model, not LoRA)
--predict_on {data_class}: Specifies the task set for prediction, which can be en, zh, code, or combinations thereof, separated by commas, e.g., en,zh,code
--output_dir ${output_dir}: Directory for storing evaluation results
--max_length ${max_length}: Maximum length of the prompt. Note that this length does not include the task-specific prompt
--gpus ${gpus}: To specify particular GPUs, use this parameter, e.g., 0,1
--e: Predict on the LongBench-E dataset. Refer to the official LongBench documentation for detailed explanations of LongBench-E.
--with_inst ${with_inst}：Whether use the system prompt and template of Llama-3-Chinese--8B-Instruct when constructing the instructions:
- true：Use the system prompt and template on all tasks
- false：Use the system prompt and template on none of tasks
- auto：Use the system prompt and template on some tasks (default strategy of LongBench official code)
We suggest setting --with_inst to auto when testing Llama-3-Chinese-8B-Instruct; setting --with_inst to false when testing Llama-3-Chinese-8B.
--use_flash_attention_2: Use Flash-Attention 2 for accelerated inference, otherwise SDPA is used for acceleration

After the model completes its run, prediction files (in jsonl format) for the respective tasks will be generated in ${output_dir}/pred/ or ${output_dir}/pred_e/ (depending on whether you tested on LongBench-E, i.e., whether -e was used). To calculate the performance metrics, execute the following command:

python eval.py --output_dir ${output_dir}

If -e was used during prediction, provide the -e parameter during evaluation as well:

python eval.py --output_dir ${output_dir} -e

The results will be stored under ${output_dir}/pred/result.json or ${output_dir}/pred_e/result.json. For example:

{
    "lsht": 42.0,
    "multifieldqa_zh": 50.28,
    "passage_retrieval_zh": 89.5,
    "vcsum": 16.41,
    "dureader": 34.15
}

中文文档

English Docs

Model Reconstruction
Model Quantization, Inference and Deployment
System Performance
Training Scripts
- Pre-training Scripts
- Instruction Fine-tuning Scripts
FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly