You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+35-35
Original file line number
Diff line number
Diff line change
@@ -7,9 +7,9 @@
7
7
> Please use WildCode with caution. Different from [EvalPlus](https://github.com/evalplus/evalplus), WildCode has a much less constrained execution environment to support tasks with diverse library dependencies. This may lead to security risks. We recommend using a sandbox such as [Docker](https://docs.docker.com/get-docker/) to run the evaluation.
WildCodeBench is a rigorous benchmark for code generation with realistic constraints in the wild. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more fine-grained descriptions and diverse tool use.
29
+
BigCodeBench is a rigorous benchmark for code generation with realistic constraints in the wild. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more fine-grained descriptions and diverse tool use.
30
30
31
31
### WildCode
32
32
33
-
To facilitate the evaluation of LLMs on WildCodeBench, we provide a Python package `wild-code` that includes the dataset, generation scripts, and evaluation scripts. The package is built on top of the [EvalPlus](https://github.com/evalplus/evalplus) framework, which is a flexible and extensible evaluation framework for code generation tasks.
33
+
To facilitate the evaluation of LLMs on BigCodeBench, we provide a Python package `bigcodebench` that includes the dataset, generation scripts, and evaluation scripts. The package is built on top of the [EvalPlus](https://github.com/evalplus/evalplus) framework, which is a flexible and extensible evaluation framework for code generation tasks.
34
34
35
35
### Why WildCode?
36
36
37
37
WildCode is a rigorous evaluation framework for LLM4Code, with:
38
38
39
-
* ✨ **Precise evaluation & ranking**: See [our leaderboard](https://wildcodebench.github.io/leaderboard.html) for latest LLM rankings before & after rigorous evaluation.
39
+
* ✨ **Precise evaluation & ranking**: See [our leaderboard](https://bigcodebench.github.io/leaderboard.html) for latest LLM rankings before & after rigorous evaluation.
40
40
* ✨ **Pre-generated samples**: WildCode accelerates code intelligence research by open-sourcing [LLM-generated samples](#-LLM-generated-code) for various models -- no need to re-run the expensive benchmarks!
41
41
42
42
### Main Differences from EvalPlus
43
43
44
44
We inherit the design of the EvalPlus framework, which is a flexible and extensible evaluation framework for code generation tasks. However, WildCode has the following differences:
45
45
* Execution Environment: The execution environment in WildCode is less bounded than EvalPlus to support tasks with diverse library dependencies.
46
-
* Test Evaluation: WildCode relies on `unittest` for evaluating the generated code, which is more suitable for the test harness in WildCodeBench.
46
+
* Test Evaluation: WildCode relies on `unittest` for evaluating the generated code, which is more suitable for the test harness in BigCodeBench.
To generate code samples from a model, you can use the following command:
92
92
>
93
93
```shell
94
-
wildcode.generate \
94
+
bigcodebench.generate \
95
95
--model [model_name] \
96
-
--dataset [wildcodebench] \
96
+
--dataset [bigcodebench] \
97
97
--nl2code [False|True] \
98
98
--greedy \
99
99
--bs [bs] \
@@ -104,11 +104,11 @@ wildcode.generate \
104
104
--tp [gpu_number]
105
105
```
106
106
>
107
-
The generated code samples will be stored in a file named `[model_name]--wildcodebench-[nl2c|c2c]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
107
+
The generated code samples will be stored in a file named `[model_name]--bigcodebench-[nl2c|c2c]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
166
166
```
167
167
168
168
<details><summary>🔎 Checking the compilability of post-processed code<i>:: click to expand ::</i></summary>
169
169
<div>
170
170
171
-
To double-check the post-processing results, you can use `wildcode.syncheck` to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:
171
+
To double-check the post-processing results, you can use `bigcodebench.syncheck` to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:
0 commit comments