-
Notifications
You must be signed in to change notification settings - Fork 146
longbench_en
LongBench is a multi-task, bilingual (Chinese and English), benchmark aimed at evaluating the long text understanding capabilities of large language models, covering multiple long-text application scenarios. Below is an introduction to the LongBench evaluation method.
Configure your environment according to the official requirements.txt
. Here are the necessary dependencies:
datasets
tqdm
rouge
jieba
fuzzywuzzy
einops
torch>=2.0.1
transformers>=4.40.0
There is no need to download data separately. The prediction script will automatically download the required data from 🤗 Datasets.
Execute the following script for inference:
model_path=path/to/llama-3-chinese
output_path=path/to/output_dir
data_class=zh
with_inst="auto"
max_length=7680
cd scripts/longbench
python pred_mxitral.py \
--model_path ${model_path} \
--predict_on ${data_class} \
--output_dir ${output_dir} \
--max_length ${max_length} \
--use_flash_attention_2 \
--with_inst ${with_inst}
-
--model_path ${model_path}
: Directory where the evaluation model is located (complete Llama-3-Chinese or Llama-3-Chinese-Instruct model, not LoRA) -
--predict_on {data_class}
: Specifies the task set for prediction, which can been
,zh
,code
, or combinations thereof, separated by commas, e.g.,en,zh,code
-
--output_dir ${output_dir}
: Directory for storing evaluation results -
--max_length ${max_length}
: Maximum length of the prompt. Note that this length does not include the task-specific prompt -
--gpus ${gpus}
: To specify particular GPUs, use this parameter, e.g.,0,1
-
--e
: Predict on the LongBench-E dataset. Refer to the official LongBench documentation for detailed explanations of LongBench-E. -
--with_inst ${with_inst}
:Whether use the system prompt and template of Llama-3-Chinese--8B-Instruct when constructing the instructions:-
true
:Use the system prompt and template on all tasks -
false
:Use the system prompt and template on none of tasks -
auto
:Use the system prompt and template on some tasks (default strategy of LongBench official code)
We suggest setting
--with_inst
toauto
when testing Llama-3-Chinese-8B-Instruct; setting--with_inst
tofalse
when testing Llama-3-Chinese-8B. -
-
--use_flash_attention_2
: Use Flash-Attention 2 for accelerated inference, otherwise SDPA is used for acceleration
After the model completes its run, prediction files (in jsonl format) for the respective tasks will be generated in ${output_dir}/pred/
or ${output_dir}/pred_e/
(depending on whether you tested on LongBench-E, i.e., whether -e
was used). To calculate the performance metrics, execute the following command:
python eval.py --output_dir ${output_dir}
If -e
was used during prediction, provide the -e
parameter during evaluation as well:
python eval.py --output_dir ${output_dir} -e
The results will be stored under ${output_dir}/pred/result.json
or ${output_dir}/pred_e/result.json
. For example:
{
"lsht": 42.0,
"multifieldqa_zh": 50.28,
"passage_retrieval_zh": 89.5,
"vcsum": 16.41,
"dureader": 34.15
}
- Model Reconstruction
- Model Quantization, Inference and Deployment
- System Performance
- Training Scripts
- FAQ