Skip to content

Commit 8fd29ee

Browse files
committed
init: rename the pack
1 parent b662c6b commit 8fd29ee

35 files changed

+166
-189
lines changed

.dockerignore

+1-1
Original file line numberDiff line numberDiff line change
@@ -165,5 +165,5 @@ cython_debug/
165165
backup/
166166
passrate.p*
167167
min_cov_dir/
168-
wildcode/_version.py
168+
bigcodebench/_version.py
169169
inspect/

.github/ISSUE_TEMPLATE/buggy_contract.yml

+5-5
Original file line numberDiff line numberDiff line change
@@ -6,29 +6,29 @@ body:
66
- type: input
77
id: version
88
attributes:
9-
label: "WildCodeBench version"
10-
description: What is the version of WildCodeBench? You can find it by running `pip show wildcode`.
9+
label: "BigCodeBench version"
10+
description: What is the version of BigCodeBench? You can find it by running `pip show bigcodebench`.
1111
placeholder: For example, 0.1.0
1212
validations:
1313
required: true
1414
- type: input
1515
id: cache
1616
attributes:
17-
label: "Output of running `ls ~/.cache/wildcode`"
17+
label: "Output of running `ls ~/.cache/bigcodebench`"
1818
validations:
1919
required: true
2020
- type: input
2121
id: task_id
2222
attributes:
2323
label: "Task ID of the programming task"
24-
placeholder: WildCodeBench/[??]
24+
placeholder: BigCodeBench/[??]
2525
validations:
2626
required: true
2727
- type: textarea
2828
id: original
2929
attributes:
3030
label: "The original wrong contract"
31-
description: You can run `python -c "from wildcode.data import get_wildcodebench print(get_wildcodebench['WildCodeBench/❓']['contract'])"`
31+
description: You can run `python -c "from bigcodebench.data import get_bigcodebench print(get_bigcodebench['BigCodeBench/❓']['contract'])"`
3232
render: python
3333
validations:
3434
required: true

.github/ISSUE_TEMPLATE/buggy_test.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,14 @@ body:
77
id: version
88
attributes:
99
label: "EvalPlus version"
10-
description: What is the version of EvalPlus? You can find it by running `pip show wildcode`.
10+
description: What is the version of EvalPlus? You can find it by running `pip show bigcodebench`.
1111
placeholder: For example, 0.1.0
1212
validations:
1313
required: true
1414
- type: input
1515
id: cache
1616
attributes:
17-
label: "Output of running `ls ~/.cache/wildcode`"
17+
label: "Output of running `ls ~/.cache/bigcodebench`"
1818
validations:
1919
required: true
2020
- type: input

.gitignore

+1-1
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,7 @@ OpenPlus/
166166
backup/
167167
passrate.p*
168168
min_cov_dir/
169-
wildcode/_version.py
169+
bigcodebench/_version.py
170170
*.jsonl
171171
inspect/
172172
*.zip

CITATION.cff

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
cff-version: 1.2.0
22
message: "If you use this work and love it, consider citing it as below \U0001F917"
3-
title: WildCodeBench
3+
title: BigCodeBench
44
authors:
5-
- family-names: WildCodeBench Team
5+
- family-names: BigCodeBench Team
66
url: https://github.com/bigcode-project/open-eval
77
doi:
88
date-released:

Docker/Evaluate.Dockerfile

+9-9
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,20 @@ RUN apt-get update && apt-get install -y git g++ python3-tk zip unzip procps r-b
77
# upgrade to latest pip
88
RUN pip install --upgrade pip
99

10-
# Add a new user "wildcodeuser"
11-
RUN adduser --disabled-password --gecos "" wildcodeuser
10+
# Add a new user "bigcodebenchuser"
11+
RUN adduser --disabled-password --gecos "" bigcodebenchuser
1212

1313
# Acquire benchmark code to local
14-
RUN git clone https://github.com/bigcode-project/code-eval.git /wildcode
14+
RUN git clone https://github.com/bigcode-project/code-eval.git /bigcodebench
1515

16-
RUN cd /wildcode && pip install . && pip install -U -I -r https://raw.githubusercontent.com/bigcode-project/wildcodebench-annotation/main/requirements.txt
16+
RUN cd /bigcodebench && pip install . && pip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench-annotation/main/requirements.txt
1717

1818
# Pre-install the dataset
19-
RUN python3 -c "from wildcode.data import get_wildcodebench; get_wildcodebench()"
19+
RUN python3 -c "from bigcodebench.data import get_bigcodebench; get_bigcodebench()"
2020

21-
RUN chown -R wildcodeuser:wildcodeuser /wildcode
22-
USER wildcodeuser
21+
RUN chown -R bigcodebenchuser:bigcodebenchuser /bigcodebench
22+
USER bigcodebenchuser
2323

24-
WORKDIR /wildcode
24+
WORKDIR /bigcodebench
2525

26-
ENTRYPOINT ["python3", "-m", "wildcode.evaluate"]
26+
ENTRYPOINT ["python3", "-m", "bigcodebench.evaluate"]

Docker/Generate_Cuda11.Dockerfile

+5-5
Original file line numberDiff line numberDiff line change
@@ -119,20 +119,20 @@ RUN source /Miniforge/etc/profile.d/conda.sh \
119119
&& pip install flash-attn==2.5.8 --no-build-isolation
120120

121121
# Acquire benchmark code to local
122-
RUN git clone https://github.com/bigcode-project/code-eval.git /wildcode
122+
RUN git clone https://github.com/bigcode-project/code-eval.git /bigcodebench
123123

124124
# Install Code-Eval and pre-load the dataset
125125
RUN source /Miniforge/etc/profile.d/conda.sh \
126126
&& source /Miniforge/etc/profile.d/mamba.sh \
127127
&& mamba activate Code-Eval \
128-
&& pip install wild-code --upgrade \
129-
&& python -c "from wildcode.data import get_wildcodebench; get_wildcodebench()"
128+
&& pip install bigcodebench --upgrade \
129+
&& python -c "from bigcodebench.data import get_bigcodebench; get_bigcodebench()"
130130

131-
WORKDIR /wildcode
131+
WORKDIR /bigcodebench
132132

133133
# Declare an argument for the huggingface token
134134
ARG HF_TOKEN
135135
RUN if [[ -n "$HF_TOKEN" ]] ; then /Miniforge/envs/Code-Eval/bin/huggingface-cli login --token $HF_TOKEN ; \
136136
else echo "No HuggingFace token specified. Access to gated or private models will be unavailable." ; fi
137137

138-
ENTRYPOINT ["/Miniforge/envs/Code-Eval/bin/python", "-m", "wildcode.generate"]
138+
ENTRYPOINT ["/Miniforge/envs/Code-Eval/bin/python", "-m", "bigcodebench.generate"]

Docker/Generate_Cuda12.Dockerfile

+5-5
Original file line numberDiff line numberDiff line change
@@ -119,20 +119,20 @@ RUN source /Miniforge/etc/profile.d/conda.sh \
119119
&& pip install flash-attn==2.5.8 --no-build-isolation
120120

121121
# Acquire benchmark code to local
122-
RUN git clone https://github.com/bigcode-project/code-eval.git /wildcode
122+
RUN git clone https://github.com/bigcode-project/code-eval.git /bigcodebench
123123

124124
# Install Code-Eval and pre-load the dataset
125125
RUN source /Miniforge/etc/profile.d/conda.sh \
126126
&& source /Miniforge/etc/profile.d/mamba.sh \
127127
&& mamba activate Code-Eval \
128-
&& pip install wild-code --upgrade \
129-
&& python -c "from wildcode.data import get_wildcodebench; get_wildcodebench()"
128+
&& pip install bigcodebench --upgrade \
129+
&& python -c "from bigcodebench.data import get_bigcodebench; get_bigcodebench()"
130130

131-
WORKDIR /wildcode
131+
WORKDIR /bigcodebench
132132

133133
# Declare an argument for the huggingface token
134134
ARG HF_TOKEN
135135
RUN if [[ -n "$HF_TOKEN" ]] ; then /Miniforge/envs/Code-Eval/bin/huggingface-cli login --token $HF_TOKEN ; \
136136
else echo "No HuggingFace token specified. Access to gated or private models will be unavailable." ; fi
137137

138-
ENTRYPOINT ["/Miniforge/envs/Code-Eval/bin/python", "-m", "wildcode.generate"]
138+
ENTRYPOINT ["/Miniforge/envs/Code-Eval/bin/python", "-m", "bigcodebench.generate"]

MANIFEST.in

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
exclude wildcode/_experimental/**/*.py
1+
exclude bigcodebench/_experimental/**/*.py

README.md

+35-35
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@
77
> Please use WildCode with caution. Different from [EvalPlus](https://github.com/evalplus/evalplus), WildCode has a much less constrained execution environment to support tasks with diverse library dependencies. This may lead to security risks. We recommend using a sandbox such as [Docker](https://docs.docker.com/get-docker/) to run the evaluation.
88
99
<p align="center">
10-
<a href="https://pypi.org/project/wild-code/"><img src="https://img.shields.io/pypi/v/wild-code?color=g"></a>
11-
<a href="https://hub.docker.com/r/terryzho/wildcode" title="Docker"><img src="https://img.shields.io/docker/image-size/terryzho/wildcode"></a>
12-
<a href="https://github.com/evalplus/evalplus/blob/master/LICENSE"><img src="https://img.shields.io/pypi/l/wild-code"></a>
10+
<a href="https://pypi.org/project/bigcodebench/"><img src="https://img.shields.io/pypi/v/bigcodebench?color=g"></a>
11+
<a href="https://hub.docker.com/r/terryzho/bigcodebench" title="Docker"><img src="https://img.shields.io/docker/image-size/terryzho/bigcodebench"></a>
12+
<a href="https://github.com/evalplus/evalplus/blob/master/LICENSE"><img src="https://img.shields.io/pypi/l/bigcodebench"></a>
1313
</p>
1414

1515
<p align="center">
@@ -24,45 +24,45 @@
2424

2525
## About
2626

27-
### WildCodeBench
27+
### BigCodeBench
2828

29-
WildCodeBench is a rigorous benchmark for code generation with realistic constraints in the wild. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more fine-grained descriptions and diverse tool use.
29+
BigCodeBench is a rigorous benchmark for code generation with realistic constraints in the wild. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more fine-grained descriptions and diverse tool use.
3030

3131
### WildCode
3232

33-
To facilitate the evaluation of LLMs on WildCodeBench, we provide a Python package `wild-code` that includes the dataset, generation scripts, and evaluation scripts. The package is built on top of the [EvalPlus](https://github.com/evalplus/evalplus) framework, which is a flexible and extensible evaluation framework for code generation tasks.
33+
To facilitate the evaluation of LLMs on BigCodeBench, we provide a Python package `bigcodebench` that includes the dataset, generation scripts, and evaluation scripts. The package is built on top of the [EvalPlus](https://github.com/evalplus/evalplus) framework, which is a flexible and extensible evaluation framework for code generation tasks.
3434

3535
### Why WildCode?
3636

3737
WildCode is a rigorous evaluation framework for LLM4Code, with:
3838

39-
***Precise evaluation & ranking**: See [our leaderboard](https://wildcodebench.github.io/leaderboard.html) for latest LLM rankings before & after rigorous evaluation.
39+
***Precise evaluation & ranking**: See [our leaderboard](https://bigcodebench.github.io/leaderboard.html) for latest LLM rankings before & after rigorous evaluation.
4040
***Pre-generated samples**: WildCode accelerates code intelligence research by open-sourcing [LLM-generated samples](#-LLM-generated-code) for various models -- no need to re-run the expensive benchmarks!
4141

4242
### Main Differences from EvalPlus
4343

4444
We inherit the design of the EvalPlus framework, which is a flexible and extensible evaluation framework for code generation tasks. However, WildCode has the following differences:
4545
* Execution Environment: The execution environment in WildCode is less bounded than EvalPlus to support tasks with diverse library dependencies.
46-
* Test Evaluation: WildCode relies on `unittest` for evaluating the generated code, which is more suitable for the test harness in WildCodeBench.
46+
* Test Evaluation: WildCode relies on `unittest` for evaluating the generated code, which is more suitable for the test harness in BigCodeBench.
4747

4848
## 🔥 Quick Start
4949

5050
> [!Tip]
5151
>
5252
> WildCode ❤️ [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)!
53-
> WildCodeBench will be integrated to bigcode-evaluation-harness, and you can also run it there!
53+
> BigCodeBench will be integrated to bigcode-evaluation-harness, and you can also run it there!
5454
5555
To get started, please first set up the environment:
5656

5757
```shell
58-
pip install wild-code --upgrade
58+
pip install bigcodebench --upgrade
5959
```
6060

6161
<details><summary>⏬ Install nightly version <i>:: click to expand ::</i></summary>
6262
<div>
6363

6464
```shell
65-
pip install "git+https://github.com/bigcode-project/wild-code.git" --upgrade
65+
pip install "git+https://github.com/bigcode-project/bigcodebench.git" --upgrade
6666
```
6767

6868
</div>
@@ -72,8 +72,8 @@ pip install "git+https://github.com/bigcode-project/wild-code.git" --upgrade
7272
<div>
7373

7474
```shell
75-
git clone https://github.com/bigcode-project/wild-code.git
76-
cd wild-code
75+
git clone https://github.com/bigcode-project/bigcodebench.git
76+
cd bigcodebench
7777
export PYTHONPATH=$PYTHONPATH:$(pwd)
7878
pip install -e .
7979
```
@@ -91,9 +91,9 @@ pip install -U flash-attn
9191
To generate code samples from a model, you can use the following command:
9292
>
9393
```shell
94-
wildcode.generate \
94+
bigcodebench.generate \
9595
--model [model_name] \
96-
--dataset [wildcodebench] \
96+
--dataset [bigcodebench] \
9797
--nl2code [False|True] \
9898
--greedy \
9999
--bs [bs] \
@@ -104,11 +104,11 @@ wildcode.generate \
104104
--tp [gpu_number]
105105
```
106106
>
107-
The generated code samples will be stored in a file named `[model_name]--wildcodebench-[nl2c|c2c]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
107+
The generated code samples will be stored in a file named `[model_name]--bigcodebench-[nl2c|c2c]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
108108
>
109109
```shell
110-
docker run --gpus '"device=$CUDA_VISIBLE_DEVICES"' -v $(pwd):/wildcode -t codeeval/code-eval-generate-cu11:25052024 --model [model_name] \
111-
--dataset [wildcodebench] \
110+
docker run --gpus '"device=$CUDA_VISIBLE_DEVICES"' -v $(pwd):/bigcodebench -t codeeval/code-eval-generate-cu11:25052024 --model [model_name] \
111+
--dataset [bigcodebench] \
112112
--nl2code [False|True] \
113113
--greedy \
114114
--bs [bs] \
@@ -144,38 +144,38 @@ Following which, you can run the built container as shown in above.
144144

145145
> [!Note]
146146
>
147-
> **Expected Schema of `[model_name]--wildcodebench-[task]--[backend]-[temp]-[n_samples].jsonl`**
147+
> **Expected Schema of `[model_name]--bigcodebench-[task]--[backend]-[temp]-[n_samples].jsonl`**
148148
>
149-
> 1. `task_id`: Task ID, which are the keys of `get_wildcodebench()`
149+
> 1. `task_id`: Task ID, which are the keys of `get_bigcodebench()`
150150
> 2. `solution` (optional): Self-contained solution (usually including the prompt)
151-
> * Example: `{"task_id": "WildCodeBench/?", "solution": "def f():\n return 1"}`
151+
> * Example: `{"task_id": "BigCodeBench/?", "solution": "def f():\n return 1"}`
152152
153153
### Code Post-processing
154154

155155
LLM-generated text may not be compilable code for including natural language lines or incomplete extra code.
156-
We provide a tool namely `wildcode.sanitize` to clean up the code:
156+
We provide a tool namely `bigcodebench.sanitize` to clean up the code:
157157

158158
```shell
159159
# 💡 If you are storing codes in jsonl:
160-
wildcode.sanitize --samples samples.jsonl
160+
bigcodebench.sanitize --samples samples.jsonl
161161
# Sanitized code will be produced to `samples-sanitized.jsonl`
162162

163163
# 💡 If you are storing codes in directories:
164-
wildcode.sanitize --samples /path/to/vicuna-[??]b_temp_[??]
164+
bigcodebench.sanitize --samples /path/to/vicuna-[??]b_temp_[??]
165165
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
166166
```
167167

168168
<details><summary>🔎 Checking the compilability of post-processed code<i>:: click to expand ::</i></summary>
169169
<div>
170170

171-
To double-check the post-processing results, you can use `wildcode.syncheck` to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:
171+
To double-check the post-processing results, you can use `bigcodebench.syncheck` to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:
172172

173173
```shell
174174
# 💡 If you are storing codes in jsonl:
175-
wildcode.syncheck --samples samples.jsonl --dataset [wildcodebench]
175+
bigcodebench.syncheck --samples samples.jsonl --dataset [bigcodebench]
176176

177177
# 💡 If you are storing codes in directories:
178-
wildcode.syncheck --samples /path/to/vicuna-[??]b_temp_[??] --dataset [wildcodebench]
178+
bigcodebench.syncheck --samples /path/to/vicuna-[??]b_temp_[??] --dataset [bigcodebench]
179179
```
180180

181181
</div>
@@ -188,23 +188,23 @@ You are strongly recommended to use a sandbox such as [docker](https://docs.dock
188188

189189
```shell
190190
# mount the current directory to the container
191-
docker run -v $(pwd):/wildcode codeeval/code-eval-evaluate:latest --dataset wildcodebench --samples samples.jsonl
191+
docker run -v $(pwd):/bigcodebench codeeval/code-eval-evaluate:latest --dataset bigcodebench --samples samples.jsonl
192192
# ...Or locally ⚠️
193-
wildcode.evaluate --dataset wildcodebench --samples samples.jsonl
193+
bigcodebench.evaluate --dataset bigcodebench --samples samples.jsonl
194194
```
195195

196196
...Or if you want to try it locally regardless of the risks ⚠️:
197197

198-
First, install the dependencies for WildCodeBench:
198+
First, install the dependencies for BigCodeBench:
199199

200200
```shell
201-
pip install -r https://raw.githubusercontent.com/bigcode-project/wildcodebench-annotation/main/requirements.txt
201+
pip install -r https://raw.githubusercontent.com/bigcode-project/bigcodebench-annotation/main/requirements.txt
202202
```
203203

204204
Then, run the evaluation:
205205

206206
```shell
207-
wildcode.evaluate --dataset [wildcodebench] --samples samples.jsonl
207+
bigcodebench.evaluate --dataset [bigcodebench] --samples samples.jsonl
208208
```
209209

210210
> [!Tip]
@@ -234,7 +234,7 @@ Reading samples...
234234
1140it [00:00, 1901.64it/s]
235235
Evaluating samples...
236236
100%|██████████████████████████████████████████| 1140/1140 [19:53<00:00, 6.75it/s]
237-
wildcodebench
237+
bigcodebench
238238
{'pass@1': 0.568}
239239
```
240240

@@ -259,7 +259,7 @@ Here are some tips to speed up the evaluation:
259259
You can inspect the failed samples by using the following command:
260260

261261
```shell
262-
wildcode.inspect --dataset [wildcodebench] --eval-results sample-sanitized_eval_results.json --in-place
262+
bigcodebench.inspect --dataset [bigcodebench] --eval-results sample-sanitized_eval_results.json --in-place
263263
```
264264

265265
## Full script
@@ -272,7 +272,7 @@ bash run.sh
272272

273273
## 💻 LLM-generated Code
274274

275-
We will share pre-generated code samples from LLMs we have [evaluated](https://wildcodebench.github.io/leaderboard.html):
275+
We will share pre-generated code samples from LLMs we have [evaluated](https://bigcodebench.github.io/leaderboard.html):
276276

277277
## Known Issues
278278

bigcodebench/__init__.py

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
try:
2+
from bigcodebench._version import __version__, __version_tuple__
3+
except ImportError:
4+
__version__ = "local-dev"

bigcodebench/data/__init__.py

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from bigcodebench.data.bigcodebench import get_bigcodebench, get_bigcodebench_hash
2+
from bigcodebench.data.utils import load_solutions, write_directory, write_jsonl

0 commit comments

Comments
 (0)