Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very low pass@1 #13

Closed
marianna13 opened this issue Jul 2, 2024 · 15 comments
Closed

Very low pass@1 #13

marianna13 opened this issue Jul 2, 2024 · 15 comments
Assignees

Comments

@marianna13
Copy link

Issue

Hey everyone,

I was trying to eval some models on the BigCodeBench but I get very low pass@1 (which is way lower than what's been reported for this model) and this warning:

BigCodeBench-Complete-calibrated
Groundtruth pass rate: 0.000
Please be cautious!
pass@1: 0.033

For reproduction

I tried granite-3b-code-base in this setup but for other models that I tried (stablelm-1..6b, granite-8b-code-base it was the same).
For both apptainer images I used docker images mentioned in this repo, both latest versions.

My cmd for evaluation:

IMAGE="/p/scratch/ccstdl/marianna/bigcodebench-evaluate_latest.sif"
SUBSET="complete"
SAVE_PATH="/p/scratch/ccstdl/marianna/bigcodebench_results/ibm-granite/granite-3b-code-base_bigcodebench_complete_0.0_1_vllm-sanitized-calibrated.jsonl"

CMD="apptainer -v run --bind $CONTAINER_HOME:/app,/tmp $IMAGE \
    --subset $SUBSET \
    --max-data-limit 16000 \
    --samples $SAVE_PATH "

srun --cpus-per-task=$SLURM_CPUS_PER_TASK $CMD

My generation cmd:

IMAGE="/p/scratch/ccstdl/marianna/bigcodebench-generate_latest.sif"
MODEL="ibm-granite/granite-3b-code-base"
MODELS_DIR="/marianna/models/"
SUBSET="complete"
BS=1
TEMPERATURE=0.0
N_SAMPLES=1
NUM_GPUS=4
SAVE_DIR="/p/scratch/ccstdl/marianna/bigcodebench_results"
BACKEND="vllm"
SAVE_PATH="${SAVE_DIR}/${MODEL}_bigcodebench_${SUBSET}_${TEMPERATURE}_${N_SAMPLES}_${BACKEND}.jsonl"


CMD="apptainer -v run --nv --bind $(pwd):/app $IMAGE \
        --subset $SUBSET \
        --model $MODELS_DIR/$MODEL \
        --greedy \
        --temperature $TEMPERATURE \
        --n_samples $N_SAMPLES \
        --backend $BACKEND \
        --tp $NUM_GPUS \
        --trust_remote_code \
        --resume \
        --save_path $SAVE_PATH"

srun --cpus-per-task=$SLURM_CPUS_PER_TASK $CMD

Please let me know if it's an issue on my side or what I can do to solve it! Thanks in advance!

@terryyz
Copy link
Collaborator

terryyz commented Jul 2, 2024

Hi @marianna13, thanks for reporting the issue!

Could you check if you had the same issues as the one mentioned in #8 (comment)? No one has reported the same issue yet, and I doubt if it is due to the broken Docker images.

The ground truth pass rate suggests you had 0%, which should not happen for the correct setup.

@marianna13
Copy link
Author

Hey Terry, thanks for quick response!
You mean whether I had memory related issues, I did, but when I added --max-data-limit 16000 they disappeared (I refer to this ImportError with matplotlib which is also mentioned in the repo).

@terryyz
Copy link
Collaborator

terryyz commented Jul 2, 2024

Could you check the eval_results.json that should be generated via the docker container? I'd like to see the detailed failures for some tasks to see what happened.

If no other issues were raised, it could be the other issues inside the environment.

@marianna13
Copy link
Author

I uploaded eval_results.json for this run here. Thanks!

@terryyz
Copy link
Collaborator

terryyz commented Jul 2, 2024

Oh, it seems that your input file doesn't have any proper generations. The solution basically just repeated the complete prompt. Are there any issues during the generation?

@terryyz terryyz self-assigned this Jul 2, 2024
@marianna13
Copy link
Author

No, there's no error, I only see outputs like

Codegen: BigCodeBench_0 @ 
/p/data1/mmlaion/marianna/models//ibm-granite/granite-3b-code-base

for all 1140 tasks.
I also have some warning messages:

Greedy decoding ON (--greedy): setting bs=1, n_samples=1, temperature=0
Rank: 0
Initializing a decoder model: /p/data1/mmlaion/marianna/models//ibm-granite/granite-3b-code-base ...
INFO 07-02 14:57:06 config.py:623] Defaulting to use mp for distributed inference
INFO 07-02 14:57:06 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/p/data1/mmlaion/marianna/models//ibm-granite/granite-3b-code-base', speculative_config=None, tokenizer='/p/data1/mmlaion/marianna/models//ibm-granite/granite-3b-code-base', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/p/data1/mmlaion/marianna/models//ibm-granite/granite-3b-code-base)
INFO 07-02 14:57:07 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
INFO 07-02 14:57:07 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:10 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:10 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:10 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:10 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:10 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:10 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:11 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:11 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:11 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m �[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:11 utils.py:637] Found nccl from library libnccl.so.2
INFO 07-02 14:57:11 utils.py:637] Found nccl from library libnccl.so.2
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:11 utils.py:637] Found nccl from library libnccl.so.2
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:11 pynccl.py:63] vLLM is using nccl==2.20.5
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:11 pynccl.py:63] vLLM is using nccl==2.20.5
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:11 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 07-02 14:57:11 utils.py:637] Found nccl from library libnccl.so.2
INFO 07-02 14:57:11 pynccl.py:63] vLLM is using nccl==2.20.5
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:13 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /p/home/jusers/nezhurina1/jureca/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
INFO 07-02 14:57:13 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /p/home/jusers/nezhurina1/jureca/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:13 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /p/home/jusers/nezhurina1/jureca/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:13 custom_all_reduce_utils.py:179] reading GPU P2P access cache from /p/home/jusers/nezhurina1/jureca/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
INFO 07-02 14:57:13 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:13 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
INFO 07-02 14:57:13 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:13 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:13 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:13 selector.py:51] Using XFormers backend.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:13 selector.py:164] Cannot use FlashAttention-2 backend for head size 80.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:13 selector.py:51] Using XFormers backend.
INFO 07-02 14:57:31 model_runner.py:160] Loading model weights took 1.6545 GB
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:31 model_runner.py:160] Loading model weights took 1.6545 GB
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:31 model_runner.py:160] Loading model weights took 1.6545 GB
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:31 model_runner.py:160] Loading model weights took 1.6545 GB
INFO 07-02 14:57:34 distributed_gpu_executor.py:56] # GPU blocks: 25599, # CPU blocks: 3276
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:38 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:38 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:38 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:38 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:38 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:38 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-02 14:57:38 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-02 14:57:38 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:52 custom_all_reduce.py:267] Registering 2275 cuda graph addresses
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:52 custom_all_reduce.py:267] Registering 2275 cuda graph addresses
INFO 07-02 14:57:52 custom_all_reduce.py:267] Registering 2275 cuda graph addresses
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:52 custom_all_reduce.py:267] Registering 2275 cuda graph addresses
�[1;36m(VllmWorkerProcess pid=17987)�[0;0m INFO 07-02 14:57:52 model_runner.py:965] Graph capturing finished in 14 secs.
�[1;36m(VllmWorkerProcess pid=17985)�[0;0m INFO 07-02 14:57:52 model_runner.py:965] Graph capturing finished in 14 secs.
�[1;36m(VllmWorkerProcess pid=17986)�[0;0m INFO 07-02 14:57:52 model_runner.py:965] Graph capturing finished in 14 secs.
INFO 07-02 14:57:52 model_runner.py:965] Graph capturing finished in 14 secs.

But that's it, no other errors or warnings.

@marianna13
Copy link
Author

Ah wait, I also found this error message (it was in the other log file):

Traceback (most recent call last):
  File "/Miniforge/envs/BigCodeBench/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_6636fb37'
Traceback (most recent call last):
  File "/Miniforge/envs/BigCodeBench/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_6636fb37'
Traceback (most recent call last):
  File "/Miniforge/envs/BigCodeBench/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_6636fb37'
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

@terryyz
Copy link
Collaborator

terryyz commented Jul 2, 2024

BTW, I forgot that you used ibm-granite/granite-3b-code-base. StarCoder2 and Granite-Code have training flaws for their models (Pages 34 & 35 on the paper). You can't generate the code correctly without stripping the newlines. To fix them, you need to pass --strip_newlines according to https://github.com/bigcode-project/bigcodebench/blob/a02256ff12cd8a30e9b87de5ebb5e7804010d228/bigcodebench/generate.py#L114C26-L114C42.

@terryyz
Copy link
Collaborator

terryyz commented Jul 2, 2024

Not sure what happened on the CUDA side 🤔 Could you check if you can successfully generate without the docker?

@terryyz
Copy link
Collaborator

terryyz commented Jul 2, 2024

Ah wait, I also found this error message (it was in the other log file):

Traceback (most recent call last):
  File "/Miniforge/envs/BigCodeBench/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_6636fb37'
Traceback (most recent call last):
  File "/Miniforge/envs/BigCodeBench/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_6636fb37'
Traceback (most recent call last):
  File "/Miniforge/envs/BigCodeBench/lib/python3.11/multiprocessing/resource_tracker.py", line 239, in main
    cache[rtype].remove(name)
KeyError: '/psm_6636fb37'
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

BTW, I double-checked this log. It appears to be a warning. I mainly believe that the ~0 pass rate is due to the trailing newlines.
You can try to run the generation again by stripping newlines. The 0 ground-truth pass rate may result from the previous failed assertion when your memory was low. You can try to remove the cached ~/.cachae/bigcodebench to see that works after increasing the memory.

@marianna13
Copy link
Author

No, unfortunately with --strip_newlines it still just repeats the problem text

@terryyz
Copy link
Collaborator

terryyz commented Jul 2, 2024

And also without docker? 👀

@terryyz
Copy link
Collaborator

terryyz commented Jul 2, 2024

Oh wait, I noticed that not all of them have the empty completion. My bad.

If you strip the newlines, I guess the pass rate should be higher..?

Maybe I'm wrong. The granite base model may require additional newlines instead of no trailing newlines. You can also check generations to see if you can get similar results.

@marianna13
Copy link
Author

without docker I get some flash attn error, I guess something is wrong with my env, I will try with clean conda env

@marianna13
Copy link
Author

I tried again with more mem + -strip_newlines and it seem to work! I got 19.9 for ibm-granite/granite-3b-code-base (leaderboard says 20). Thank you very much for your help! Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants