Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Three Bugs in async E2B code sandbox #493

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rasdani
Copy link

@rasdani rasdani commented Mar 9, 2025

The previous implementation and #484 on async E2B sandbox have three bugs.

First in this line, verification_info is actually a list. See the list comprehension right above.

rewards = run_async_from_sync(scripts, verification_info["language"])

Here also screenshot from pdb:
Screenshot 2025-03-09 at 00 21 23

Fixed by collecting languages first and from then on passing languages: list[str] around.

Second bug is language was not passed to run_script.

tasks = [run_script(sbx, script) for script in scripts]

Even though it expects language.
async def run_script(sbx, script: str, language: str) -> float:

Third bug: see comment below.

I also added more specific error handling, as exceptions of these three bugs were handle too generally.

@rasdani
Copy link
Author

rasdani commented Mar 9, 2025

Code rewards after fix. Previously got none.
image

@rasdani rasdani changed the title FIX: Async E2B code sandbox bug FIX: Bug in async E2B code sandbox Mar 9, 2025
@rasdani
Copy link
Author

rasdani commented Mar 9, 2025

Even though we set a sandbox timeout of 30, sbx.run_code() still runs into its own default timeout of 300s.
Screenshot 2025-03-09 at 14 22 27

Here's how I timed it:

async def run_script(sbx, script: str, language: str) -> float:
    try:
        start_time = time.time()
        execution = await sbx.run_code(script, language=language)
        end_time = time.time()
        print(f"Script execution time: {end_time - start_time} seconds")
    except e2b.TimeoutException as e:
        end_time = time.time()
        print(f"Script execution time: {end_time - start_time} seconds")
        print(f"TimeoutException from E2B executor: {e}")
        is_running = await sbx.is_running()
        print(f"{is_running=}")
        breakpoint()
        is_running = await sbx.is_running()
        print(f"{is_running=}")
        return 0.0
    except Exception as e:
        print(f"Error from E2B executor: {e}")
        return 0.0
    try:
        return float(execution.text)
    except (TypeError, ValueError):
        return 0.0

@rasdani rasdani changed the title FIX: Bug in async E2B code sandbox FIX: Bugs in async E2B code sandbox Mar 9, 2025
@rasdani rasdani changed the title FIX: Bugs in async E2B code sandbox FIX: Three Bugs in async E2B code sandbox Mar 9, 2025
@rasdani
Copy link
Author

rasdani commented Mar 9, 2025

On a more general note:

I trained with open-r1/verifiable-coding-problems-python_decontaminated. It has a long tail of examples with a lot of test cases (see plots in my new dataset).

Even though evaluate_code() limits time for subprocesses to 5s, I still created a derived dataset limiting test cases to at most six.
rasdani/verifiable-coding-problems-python_decontaminated_fewer_test_cases

Since script execution is the bottleneck, one can think about checking code with ruff first before executing it.
Unfortunately Python does not have a compiler like Rust, otherwise something like this would be feasible.

When I ran in to execution timeouts, I also noticed this was due to long code slop. Maybe discarding unreasonably long completions could also help.

@rasdani
Copy link
Author

rasdani commented Mar 9, 2025

Also "problem_statement" key in open-r1/verifiable-coding-problems-python

does not match with

prompt.append({"role": "user", "content": example["problem"]})

so recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code.yaml is currently broken.

open-r1/verifiable-coding-problems-python_decontaminated works just fine though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant