Very low pass@1 #13

marianna13 · 2024-07-02T14:03:31Z

Issue

Hey everyone,

I was trying to eval some models on the BigCodeBench but I get very low pass@1 (which is way lower than what's been reported for this model) and this warning:

BigCodeBench-Complete-calibrated
Groundtruth pass rate: 0.000
Please be cautious!
pass@1: 0.033

For reproduction

I tried granite-3b-code-base in this setup but for other models that I tried (stablelm-1..6b, granite-8b-code-base it was the same).
For both apptainer images I used docker images mentioned in this repo, both latest versions.

My cmd for evaluation:

IMAGE="/p/scratch/ccstdl/marianna/bigcodebench-evaluate_latest.sif"
SUBSET="complete"
SAVE_PATH="/p/scratch/ccstdl/marianna/bigcodebench_results/ibm-granite/granite-3b-code-base_bigcodebench_complete_0.0_1_vllm-sanitized-calibrated.jsonl"

CMD="apptainer -v run --bind $CONTAINER_HOME:/app,/tmp $IMAGE \
    --subset $SUBSET \
    --max-data-limit 16000 \
    --samples $SAVE_PATH "

srun --cpus-per-task=$SLURM_CPUS_PER_TASK $CMD

My generation cmd:

IMAGE="/p/scratch/ccstdl/marianna/bigcodebench-generate_latest.sif"
MODEL="ibm-granite/granite-3b-code-base"
MODELS_DIR="/marianna/models/"
SUBSET="complete"
BS=1
TEMPERATURE=0.0
N_SAMPLES=1
NUM_GPUS=4
SAVE_DIR="/p/scratch/ccstdl/marianna/bigcodebench_results"
BACKEND="vllm"
SAVE_PATH="${SAVE_DIR}/${MODEL}_bigcodebench_${SUBSET}_${TEMPERATURE}_${N_SAMPLES}_${BACKEND}.jsonl"


CMD="apptainer -v run --nv --bind $(pwd):/app $IMAGE \
        --subset $SUBSET \
        --model $MODELS_DIR/$MODEL \
        --greedy \
        --temperature $TEMPERATURE \
        --n_samples $N_SAMPLES \
        --backend $BACKEND \
        --tp $NUM_GPUS \
        --trust_remote_code \
        --resume \
        --save_path $SAVE_PATH"

srun --cpus-per-task=$SLURM_CPUS_PER_TASK $CMD

Please let me know if it's an issue on my side or what I can do to solve it! Thanks in advance!

The text was updated successfully, but these errors were encountered:

terryyz · 2024-07-02T14:12:26Z

Hi @marianna13, thanks for reporting the issue!

Could you check if you had the same issues as the one mentioned in #8 (comment)? No one has reported the same issue yet, and I doubt if it is due to the broken Docker images.

The ground truth pass rate suggests you had 0%, which should not happen for the correct setup.

marianna13 · 2024-07-02T14:24:19Z

Hey Terry, thanks for quick response!
You mean whether I had memory related issues, I did, but when I added --max-data-limit 16000 they disappeared (I refer to this ImportError with matplotlib which is also mentioned in the repo).

terryyz · 2024-07-02T14:30:22Z

Could you check the eval_results.json that should be generated via the docker container? I'd like to see the detailed failures for some tasks to see what happened.

If no other issues were raised, it could be the other issues inside the environment.

marianna13 · 2024-07-02T14:37:53Z

I uploaded eval_results.json for this run here. Thanks!

terryyz · 2024-07-02T14:41:13Z

Oh, it seems that your input file doesn't have any proper generations. The solution basically just repeated the complete prompt. Are there any issues during the generation?

marianna13 · 2024-07-02T14:48:11Z

No, there's no error, I only see outputs like

Codegen: BigCodeBench_0 @ 
/p/data1/mmlaion/marianna/models//ibm-granite/granite-3b-code-base

for all 1140 tasks.
I also have some warning messages:

Greedy decoding ON (--greedy): setting bs=1, n_samples=1, temperature=0
Rank: 0
Initializing a decoder model: /p/data1/mmlaion/marianna/models//ibm-granite/granite-3b-code-base ...
INFO 07-02 14:57:06 config.py:623] Defaulting to use mp for distributed inference
INFO 07-02 14:57:06 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/p/data1/mmlaion/marianna/models//ibm-granite/granite-3b-code-base', speculative_config=None, tokenizer='/p/data1/mmlaion/marianna/models//ibm-granite/granite-3b-code-base', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/p/data1/mmlaion/marianna/models//ibm-granite/granite-3b-code-base)
INFO 07-02 14:57:07 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
INFO 07-02 14:57:07 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:10 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:10 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:10 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:10 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:10 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:10 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:11 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:11 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:11 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m �[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:11 utils.py:637] Found nccl from library libnccl.so.2
INFO 07-02 14:57:11 utils.py:637] Found nccl from library libnccl.so.2
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:11 utils.py:637] Found nccl from library libnccl.so.2
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:11 pynccl.py:63] vLLM is using nccl==2.20.5
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:11 pynccl.py:63] vLLM is using nccl==2.20.5
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:11 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-02 14:57:11 utils.py:637] Found nccl from library libnccl.so.2
INFO 07-02 14:57:11 pynccl.py:63] vLLM is using nccl==2.20.5
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:13 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /p/home/jusers/nezhurina1/jureca/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
INFO 07-02 14:57:13 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /p/home/jusers/nezhurina1/jureca/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:13 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /p/home/jusers/nezhurina1/jureca/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:13 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /p/home/jusers/nezhurina1/jureca/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
INFO 07-02 14:57:13 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:13 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
INFO 07-02 14:57:13 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:13 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:13 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:13 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:13 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:13 selector.py:51] Using XFormers backend.
INFO 07-02 14:57:31 model_runner.py:160] Loading model weights took 1.6545 GB
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:31 model_runner.py:160] Loading model weights took 1.6545 GB
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:31 model_runner.py:160] Loading model weights took 1.6545 GB
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:31 model_runner.py:160] Loading model weights took 1.6545 GB
INFO 07-02 14:57:34 distributed_gpu_executor.py:56] # GPU blocks: 25599, # CPU blocks: 3276
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:38 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:38 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:38 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:38 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:38 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:38 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-02 14:57:38 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-02 14:57:38 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:52 custom_all_reduce.py:267] Registering 2275 cuda graph addresses
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:52 custom_all_reduce.py:267] Registering 2275 cuda graph addresses
INFO 07-02 14:57:52 custom_all_reduce.py:267] Registering 2275 cuda graph addresses
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:52 custom_all_reduce.py:267] Registering 2275 cuda graph addresses
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:52 model_runner.py:965] Graph capturing finished in 14 secs.
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:52 model_runner.py:965] Graph capturing finished in 14 secs.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:52 model_runner.py:965] Graph capturing finished in 14 secs.
INFO 07-02 14:57:52 model_runner.py:965] Graph capturing finished in 14 secs.

But that's it, no other errors or warnings.

marianna13 · 2024-07-02T14:52:09Z

Ah wait, I also found this error message (it was in the other log file):

Traceback (most recent call last):
  File "/Miniforge/envs/BigCodeBench/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_6636fb37'
Traceback (most recent call last):
  File "/Miniforge/envs/BigCodeBench/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_6636fb37'
Traceback (most recent call last):
  File "/Miniforge/envs/BigCodeBench/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_6636fb37'
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

terryyz · 2024-07-02T14:54:22Z

BTW, I forgot that you used ibm-granite/granite-3b-code-base. StarCoder2 and Granite-Code have training flaws for their models (Pages 34 & 35 on the paper). You can't generate the code correctly without stripping the newlines. To fix them, you need to pass --strip_newlines according to https://github.com/bigcode-project/bigcodebench/blob/a02256ff12cd8a30e9b87de5ebb5e7804010d228/bigcodebench/generate.py#L114C26-L114C42.

terryyz · 2024-07-02T14:57:21Z

Not sure what happened on the CUDA side 🤔 Could you check if you can successfully generate without the docker?

terryyz · 2024-07-02T19:06:43Z

Ah wait, I also found this error message (it was in the other log file):

Traceback (most recent call last):
  File "/Miniforge/envs/BigCodeBench/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_6636fb37'
Traceback (most recent call last):
  File "/Miniforge/envs/BigCodeBench/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_6636fb37'
Traceback (most recent call last):
  File "/Miniforge/envs/BigCodeBench/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_6636fb37'
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

BTW, I double-checked this log. It appears to be a warning. I mainly believe that the ~0 pass rate is due to the trailing newlines.
You can try to run the generation again by stripping newlines. The 0 ground-truth pass rate may result from the previous failed assertion when your memory was low. You can try to remove the cached ~/.cachae/bigcodebench to see that works after increasing the memory.

marianna13 · 2024-07-02T20:22:24Z

No, unfortunately with --strip_newlines it still just repeats the problem text

terryyz · 2024-07-02T20:23:55Z

And also without docker? 👀

terryyz · 2024-07-02T20:29:15Z

Oh wait, I noticed that not all of them have the empty completion. My bad.

If you strip the newlines, I guess the pass rate should be higher..?

Maybe I'm wrong. The granite base model may require additional newlines instead of no trailing newlines. You can also check generations to see if you can get similar results.

marianna13 · 2024-07-02T21:02:05Z

without docker I get some flash attn error, I guess something is wrong with my env, I will try with clean conda env

marianna13 · 2024-07-03T10:54:22Z

I tried again with more mem + -strip_newlines and it seem to work! I got 19.9 for ibm-granite/granite-3b-code-base (leaderboard says 20). Thank you very much for your help! Closing this issue.

terryyz self-assigned this Jul 2, 2024

marianna13 closed this as completed Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very low pass@1 #13

Very low pass@1 #13

marianna13 commented Jul 2, 2024

terryyz commented Jul 2, 2024

marianna13 commented Jul 2, 2024

terryyz commented Jul 2, 2024

marianna13 commented Jul 2, 2024

terryyz commented Jul 2, 2024

marianna13 commented Jul 2, 2024

marianna13 commented Jul 2, 2024

terryyz commented Jul 2, 2024 •

edited

Loading

terryyz commented Jul 2, 2024

terryyz commented Jul 2, 2024

marianna13 commented Jul 2, 2024

terryyz commented Jul 2, 2024

terryyz commented Jul 2, 2024 •

edited

Loading

marianna13 commented Jul 2, 2024

marianna13 commented Jul 3, 2024

Very low pass@1 #13

Very low pass@1 #13

Comments

marianna13 commented Jul 2, 2024

Issue

For reproduction

terryyz commented Jul 2, 2024

marianna13 commented Jul 2, 2024

terryyz commented Jul 2, 2024

marianna13 commented Jul 2, 2024

terryyz commented Jul 2, 2024

marianna13 commented Jul 2, 2024

marianna13 commented Jul 2, 2024

terryyz commented Jul 2, 2024 • edited Loading

terryyz commented Jul 2, 2024

terryyz commented Jul 2, 2024

marianna13 commented Jul 2, 2024

terryyz commented Jul 2, 2024

terryyz commented Jul 2, 2024 • edited Loading

marianna13 commented Jul 2, 2024

marianna13 commented Jul 3, 2024

terryyz commented Jul 2, 2024 •

edited

Loading

terryyz commented Jul 2, 2024 •

edited

Loading