Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation Problem #28

Open
LIP773 opened this issue Nov 22, 2024 · 1 comment
Open

Evaluation Problem #28

LIP773 opened this issue Nov 22, 2024 · 1 comment

Comments

@LIP773
Copy link

LIP773 commented Nov 22, 2024

Hi, based on your guidance, I train my model based on Qwen 1.5-1.8B.
While conducting the evaluation, I noticed that there appear to be some issues with the SQA and MMBench evaluations. The results are quite low, and the evaluation time is extremely long (12 hours or more). This problem seems to occur only when "Setting pad_token_id to eos_token_id:151643 for open-end generation."

Here is my evaluation script:

`export IMP_SILIENT_OTHERS=true

gpu_list="${CUDA_VISIBLE_DEVICES:-0}"
IFS=',' read -ra GPULIST <<< "$gpu_list"

CHUNKS=${#GPULIST[@]}

SPLIT="llava_scienceqa"

MODEL_CKPT="imp-v1-2b-stage2-lora"
EVAL_CKPT="${MODEL_CKPT////_}_1"
MODEL_BASE=checkpoints/base/Qwen1.5-1.8B

for IDX in $(seq 0 $((CHUNKS-1))); do
CUDA_VISIBLE_DEVICES=${GPULIST[$IDX]} python -m imp_llava.eval.model_vqa_science
--model-path ./checkpoints/$MODEL_CKPT
--model-base $MODEL_BASE
--question-file ./eval_dataset/scienceqa/llava_test_CQM-A.json
--image-folder ./eval_dataset/scienceqa/images/test
--answers-file ./eval_dataset/scienceqa/answers/$SPLIT/$EVAL_CKPT/${CHUNKS}_${IDX}.jsonl
--num-chunks $CHUNKS
--chunk-idx $IDX
--temperature 0
--conv-mode qwen2 &
done

wait

output_file=./eval_dataset/scienceqa/answers/$SPLIT/$EVAL_CKPT/merge.jsonl

Clear out the output file if it exists.

"$output_file"

Loop through the indices and concatenate each file.

for IDX in $(seq 0 $((CHUNKS-1))); do
cat ./eval_dataset/scienceqa/answers/$SPLIT/$EVAL_CKPT/${CHUNKS}_${IDX}.jsonl >> "$output_file"
done

python imp_llava/eval/eval_science_qa.py
--base-dir ./eval_dataset/scienceqa
--result-file $output_file
--output-file ./eval_dataset/scienceqa/answers/output.jsonl
--output-result ./eval_dataset/scienceqa/answers/result.json
`

Other than that, the rest of the benchmark evaluations seem to be normal. I would like to know if you have encountered similar issues and what are the solutions?

@romrawinjp
Copy link

I faced the same problem. The long inference time could be because of max_new_tokens was set to 1024. I think stopping_criteria was confusing when we're using Qwen family.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants