Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why tokenizer.pad_token == args.pad (i.e., 50256)?? #57

Open
9yte opened this issue May 2, 2023 · 0 comments
Open

Why tokenizer.pad_token == args.pad (i.e., 50256)?? #57

9yte opened this issue May 2, 2023 · 0 comments

Comments

@9yte
Copy link

9yte commented May 2, 2023

Hi,

For my project, I'm trying to fine-tune CodeGen models on my dataset and evaluate the resulting fine-tuned model on the HumanEval benchmark dataset. I have a few questions that I would appreciate if you could address.

  1. First, why in the sampling code, at line 234, we have tokenizer.pad_token == args.pad, which is 50256. Shouldn't we set the pad_token to eos_token, not 50256 (which is the eos_token_id)? I'm confused by this. At line 240, you set the parameter pad_token_id=args.pad. So in your sampling code, both pad_token and pad_token_id are set to 50256. Can you please elaborate on this? That would be super helpful.

  2. As a baseline, I need to replicate your single-turn HumanEval benchmark results, but unfortunately, I'm getting surprisingly lower results compared to what is reported in the paper. And, I'm 99% positive that I'm probably missing a point. To produce Table 1 results in the paper, did you use the exact same sampling procedure as sample.py?

Thanks a lot for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant