Fork with more benchmarks and features: Merge some of them? #7
matthiasgeihs
started this conversation in
Show and tell
Replies: 1 comment
-
If canonical solutions are already in the training dataset it means that one should invalidate the results. Seems like a big problem for reproducibility :( |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey @abacaj, I really like your repository. I always found it a bit puzzling how these benchmark results have been created, because pre- and post-processing does play a role and it is often not clearly documented. Your repository solves that!
I am the maintainer of fork torusresearch/code-eval. There we added some things:
validate.py
which analyzes this overlap and added a corresponding column in the table.I am happy to contribute (a subset of) these features to this repository. Let me know.
Beta Was this translation helpful? Give feedback.
All reactions