Fork with more benchmarks and features: Merge some of them? #7

matthiasgeihs · 2023-07-13T07:35:03Z

matthiasgeihs
Jul 13, 2023

Hey @abacaj, I really like your repository. I always found it a bit puzzling how these benchmark results have been created, because pre- and post-processing does play a role and it is often not clearly documented. Your repository solves that!

I am the maintainer of fork torusresearch/code-eval. There we added some things:

New models: I've been tinkering with finetuning Replit Code 3B for Human Eval performance. My best model matorus/replit-coder now outperforms Teknium's 3B finetuned model (28.7% vs 25.8%).
- Results for MPT 30B: Ran the evaluation for MPT 30B and added the results.
Autoresume functionality: Have modified the evaluation a bit to write results immediately and support autoresume if an evaluation is interrupted.
Overlap validation: When looking at the generated completions, I noticed that for some models there seems to be a significant overlap between completions and canonical solutions from the training dataset. I added a script validate.py which analyzes this overlap and added a corresponding column in the table.

I am happy to contribute (a subset of) these features to this repository. Let me know.

kwikiel · 2023-07-14T12:04:11Z

kwikiel
Jul 14, 2023

If canonical solutions are already in the training dataset it means that one should invalidate the results. Seems like a big problem for reproducibility :(

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fork with more benchmarks and features: Merge some of them? #7

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Fork with more benchmarks and features: Merge some of them? #7

Uh oh!

Uh oh!

matthiasgeihs Jul 13, 2023

Replies: 1 comment

Uh oh!

kwikiel Jul 14, 2023

matthiasgeihs
Jul 13, 2023

kwikiel
Jul 14, 2023