Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Multiple-E dataset fix go_test.go path for test execution #225

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hitesh-1997
Copy link

@hitesh-1997 hitesh-1997 commented Apr 20, 2024

Context along with reproducer described in the issue:
Resolves #224

When I checked the different values of language field can in in the dataset for different languages supported in themultiple.py, seems like all of them have the same name except go (Adding the repro and screenshot below for this statement).

LANGUAGES = [ "py", "sh", "cpp", "cs", "d", "go", "java", "js", "jl", "lua", "pl", "php", "r", "rkt", "rb", "rs", "scala", "swift", "ts"]
for lang in LANGUAGES:
    data = load_dataset('nuprl/MultiPL-E', f'humaneval-{lang}', split='test', revision="d23b094346c5dbda1080a74bb2a24c18adbf7409")
    print(f"languages in multiple-E for {lang}: {set([dt['language'] for dt in data])}")
image

I tried to track the flow of problem['language'] field which I am changing in the PR to make sure it doesn't affect any other language, it seems like this field is only used in the containerized_eval.py to decide the evaluation script to execute and the file extension, and shouldn't affect other language.
I guess the additional fields and seperate handling of the go_test.go was done for different dataset revisions ?

Please let me know if the flow looks good.

Thanks !!

@arjunguha
Copy link
Contributor

I haven't run this, but this looks accurate. It reflects the structure of the original MultiPL-E code.

@hitesh-1997
Copy link
Author

I haven't run this, but this looks accurate. It reflects the structure of the original MultiPL-E code.

Thanks for the response !!
Please let me know if I should add any additional checks or sanity to verify the changes.
Also, please let me know if I should tag someone else to have a look at the PR :)

Copy link
Collaborator

@loubnabnl loubnabnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this!

@arjunguha I think you rebuilt the MultiPL-E docker images with some recent fixes but we never published them, do you want to publish new images with this fix too?

@hitesh-1997
Copy link
Author

Thanks a lot for your response !!
@arjunguha @loubnabnl I have some additional questions on the way we are post-processing the outputs. It would really help, if you can take sometime to answer those :)

I see that we are using stop-tokens present in the dataset, to extract the final generation code. I noticed in some of the cases:

  • If the LLM outputs some helper function such as isPrime, the test fails but in reality just including that function can make the test pass. I see similar outputs in the bigcode/MultiPL-E-completions dataset as well. Should we update the post-processing to correct such cases ?
  • If the LLM use some additional imports such as strings in go or Counter in python to complete the function, such imports are not included in the final completions code, leading to test failures.
  • Finally, some of the languages like ruby have some very basic stop tokens like \n\n. So if LLM outputs one additional new line in between the function body, this leads to wrong truncation. I find the pass rate to match for GPT-4, after removing such tokens. Just wondering, if this should be updated or generations could be wrong from my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple-E Go test file name suffix does not contain _test.go
3 participants