Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Video upload to wandb broken since 2.4.0 #2055

Open
5 tasks done
OliverUrbann opened this issue Dec 13, 2024 · 9 comments
Open
5 tasks done

[Bug]: Video upload to wandb broken since 2.4.0 #2055

OliverUrbann opened this issue Dec 13, 2024 · 9 comments
Labels
bug Something isn't working more information needed Please fill the issue template completely

Comments

@OliverUrbann
Copy link

🐛 Bug

Using stable_baselines3 2.3.2 in Python 3.11 the provided unit test can upload videos to WANDB successfully. However, using 2.4 it fails.

To Reproduce

import unittest
import time
import os
import gymnasium as gym
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv
import wandb
from wandb import Api
from wandb.integration.sb3 import WandbCallback
from stable_baselines3 import PPO

class TestWandbVideoUpload(unittest.TestCase):
    def test_video_upload(self):
        env_id = "CartPole-v1"
        video_folder = "videos"
        video_length = 100

        vec_env = DummyVecEnv([lambda: gym.make(env_id, render_mode="rgb_array")])

        obs = vec_env.reset()

        run = wandb.init(
            project="test",
            sync_tensorboard=True,  # Automatically upload SB3's TensorBoard metrics
            monitor_gym=True,       # Automatically upload agent playing videos
            # save_code=True,       # Optional
        )

        # Record the video starting at the first step
        vec_env = VecVideoRecorder(
            vec_env,
            video_folder,
            record_video_trigger=lambda x: x == 0,
            video_length=video_length,
            name_prefix=f"agent-{env_id}"
        )

        vec_env.reset()

        model = PPO("MlpPolicy", vec_env, verbose=1, tensorboard_log=f"runs/{run.id}")
        model.learn(
            total_timesteps=5000,
            callback=WandbCallback(
                model_save_path=f"tmp/models/{run.id}",
                verbose=2,
            ),
        )
        run.finish()

        # Give some time for the upload (adjust depending on connection speed)
        time.sleep(30)

        # Use the wandb API to check the run
        api = Api()
        # If you're logged into a different W&B account or using an organization, adjust 'entity' accordingly
        run_path = f"{run.entity}/{run.project}/{run.id}"
        run_api = api.run(run_path)

        # Retrieve a list of all files in the run
        files = run_api.files()
        file_names = [f.name for f in files]

        # Check if a video file is present
        video_files = [name for name in file_names if name.endswith('.mp4')]

        self.assertTrue(len(video_files) > 0, "The video was not uploaded to wandb.")

        # Optional: Print the uploaded video files
        print("Uploaded video files:", video_files)

        # Clean up
        vec_env.close()
        wandb.finish()

if __name__ == '__main__':
    unittest.main()

Relevant log output / Error message

No response

System Info

  • OS: Linux-5.15.0-124-generic-x86_64-with-glibc2.35 # 134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024
  • Python: 3.11.0rc1
  • Stable-Baselines3: 2.4.0
  • PyTorch: 2.5.1+cu124
  • GPU Enabled: False
  • Numpy: 1.26.4
  • Cloudpickle: 3.1.0
  • Gymnasium: 0.29.1

Checklist

  • My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • I have provided a minimal and working example to reproduce the bug
  • I've used the markdown code blocks for both code and stack traces.
@OliverUrbann OliverUrbann added the bug Something isn't working label Dec 13, 2024
@araffin
Copy link
Member

araffin commented Dec 13, 2024

Hello,
could you provide the error message too?

@araffin araffin added the more information needed Please fill the issue template completely label Dec 13, 2024
@OliverUrbann
Copy link
Author

OliverUrbann commented Dec 13, 2024

Sure, here is the log downloaded from wandb produced by the provided script:

Using cpu device
MoviePy - Building video /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4.
MoviePy - Writing video /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
wandb: WARNING Found log directory outside of given root_logdir, dropping given root_logdir for event file in ../tmp/tests/runs/jnkaujln/PPO_1

MoviePy - Done !
MoviePy - video ready /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
Logging to ../tmp/tests/runs/jnkaujln/PPO_1
Saving video to /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
MoviePy - Building video /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4.
MoviePy - Writing video /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
                                                                        

MoviePy - Done !
MoviePy - video ready /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
-----------------------------
| time/              |      |
|    fps             | 1479 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1418        |
|    iterations           | 2           |
|    time_elapsed         | 2           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.009116981 |
|    clip_fraction        | 0.111       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | 0.00189     |
|    learning_rate        | 0.0003      |
|    loss                 | 8.96        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0165     |
|    value_loss           | 51.3        |
-----------------------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 1393         |
|    iterations           | 3            |
|    time_elapsed         | 4            |
|    total_timesteps      | 6144         |
| train/                  |              |
|    approx_kl            | 0.0094210915 |
|    clip_fraction        | 0.0634       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.667       |
|    explained_variance   | 0.0815       |
|    learning_rate        | 0.0003       |
|    loss                 | 13.8         |
|    n_updates            | 20           |
|    policy_gradient_loss | -0.0178      |
|    value_loss           | 33.9         |
------------------------------------------

So actually I don't see a relevant error. The log msg that the file is ready is correct, there is working video file, but not uploaded.

@araffin
Copy link
Member

araffin commented Dec 13, 2024

Do you see any difference in the filenames/logs compared to SB3<2.4.0 ?

EDIT: what is you wandb version? latest should be wandb==0.19.1

@OliverUrbann
Copy link
Author

It is the latest.

Package                      Version
---------------------------- --------------
absl-py                      2.1.0
annotated-types              0.7.0
anyio                        4.7.0
argon2-cffi                  23.1.0
argon2-cffi-bindings         21.2.0
arrow                        1.3.0
asttokens                    3.0.0
astunparse                   1.6.3
async-lru                    2.0.4
attrs                        24.2.0
babel                        2.16.0
beautifulsoup4               4.12.3
bleach                       6.2.0
blinker                      1.4
cachetools                   5.5.0
certifi                      2024.8.30
cffi                         1.17.1
charset-normalizer           3.4.0
click                        8.1.7
cloudpickle                  3.1.0
coloredlogs                  15.0.1
comm                         0.2.2
contourpy                    1.3.1
cryptography                 3.4.8
cycler                       0.12.1
dbus-python                  1.2.18
debugpy                      1.8.10
decorator                    4.4.2
defusedxml                   0.7.1
distro                       1.7.0
distro-info                  1.1+ubuntu0.2
dm-tree                      0.1.8
docker-pycreds               0.4.0
executing                    2.1.0
Farama-Notifications         0.0.4
fastjsonschema               2.21.1
filelock                     3.16.1
flatbuffers                  24.3.25
fonttools                    4.55.3
fqdn                         1.5.1
fsspec                       2024.10.0
gast                         0.6.0
gitdb                        4.0.11
GitPython                    3.1.43
google-auth                  2.36.0
google-auth-oauthlib         1.2.1
google-pasta                 0.2.0
grpcio                       1.68.1
gymnasium                    0.29.1
h11                          0.14.0
h5py                         3.12.1
httpcore                     1.0.7
httplib2                     0.20.2
httpx                        0.28.1
humanfriendly                10.0
idna                         3.10
imageio                      2.36.1
imageio-ffmpeg               0.5.1
importlib-metadata           4.6.4
iniconfig                    2.0.0
ipykernel                    6.29.5
ipython                      8.30.0
ipywidgets                   8.1.5
isoduration                  20.11.0
jedi                         0.19.2
jeepney                      0.7.1
Jinja2                       3.1.4
json5                        0.10.0
jsonpointer                  3.0.0
jsonschema                   4.23.0
jsonschema-specifications    2024.10.1
jupyter                      1.1.1
jupyter_client               8.6.3
jupyter-console              6.6.3
jupyter_core                 5.7.2
jupyter-events               0.10.0
jupyter-lsp                  2.2.5
jupyter_server               2.14.2
jupyter_server_terminals     0.5.3
jupyterlab                   4.3.3
jupyterlab_pygments          0.3.0
jupyterlab_server            2.27.3
jupyterlab_widgets           3.0.13
keras                        2.15.0
keyring                      23.5.0
kiwisolver                   1.4.7
launchpadlib                 1.10.16
lazr.restfulclient           0.14.4
lazr.uri                     1.0.6
libclang                     18.1.1
Markdown                     3.7
MarkupSafe                   3.0.2
matplotlib                   3.9.3
matplotlib-inline            0.1.7
mistune                      3.0.2
ml-dtypes                    0.2.0
more-itertools               8.10.0
moviepy                      2.1.1
mpmath                       1.3.0
nbclient                     0.10.1
nbconvert                    7.16.4
nbformat                     5.10.4
nest-asyncio                 1.6.0
networkx                     3.4.2
notebook                     7.3.1
notebook_shim                0.2.4
numpy                        1.26.4
nvidia-cublas-cu12           12.4.5.8
nvidia-cuda-cupti-cu12       12.4.127
nvidia-cuda-nvrtc-cu12       12.4.127
nvidia-cuda-runtime-cu12     12.4.127
nvidia-cudnn-cu12            9.1.0.70
nvidia-cufft-cu12            11.2.1.3
nvidia-curand-cu12           10.3.5.147
nvidia-cusolver-cu12         11.6.1.9
nvidia-cusparse-cu12         12.3.1.170
nvidia-nccl-cu12             2.21.5
nvidia-nvjitlink-cu12        12.4.127
nvidia-nvtx-cu12             12.4.127
oauthlib                     3.2.0
onnx                         1.15.0
onnx-tf                      1.10.0
onnxruntime                  1.17.1
opt_einsum                   3.4.0
overrides                    7.7.0
packaging                    24.2
pandas                       2.2.3
pandocfilters                1.5.1
parso                        0.8.4
pexpect                      4.9.0
pillow                       10.4.0
pip                          22.0.2
platformdirs                 4.3.6
pluggy                       1.5.0
proglog                      0.1.10
prometheus_client            0.21.1
prompt_toolkit               3.0.48
protobuf                     4.25.5
psutil                       6.1.0
ptyprocess                   0.7.0
pure_eval                    0.2.3
pyasn1                       0.6.1
pyasn1_modules               0.4.1
pycparser                    2.22
pydantic                     2.10.3
pydantic_core                2.27.1
pygame                       2.6.1
Pygments                     2.18.0
PyGObject                    3.42.1
PyJWT                        2.3.0
pyparsing                    2.4.7
pytest                       8.3.4
python-apt                   2.4.0+ubuntu4
python-dateutil              2.9.0.post0
python-dotenv                1.0.1
python-json-logger           2.0.7
pytz                         2024.2
PyVirtualDisplay             3.0
PyYAML                       6.0.2
pyzbar                       0.1.9
pyzmq                        26.2.0
referencing                  0.35.1
requests                     2.32.3
requests-oauthlib            2.0.0
rfc3339-validator            0.1.4
rfc3986-validator            0.1.1
rpds-py                      0.22.3
rsa                          4.9
scipy                        1.14.1
SecretStorage                3.3.1
Send2Trash                   1.8.3
sentry-sdk                   2.19.2
setproctitle                 1.3.4
setuptools                   59.6.0
six                          1.16.0
smmap                        5.0.1
sniffio                      1.3.1
soupsieve                    2.6
stable_baselines3            2.3.2
stack-data                   0.6.3
sympy                        1.13.1
tensorboard                  2.15.2
tensorboard-data-server      0.7.2
tensorflow                   2.15.0
tensorflow-addons            0.23.0
tensorflow-estimator         2.15.0
tensorflow-io-gcs-filesystem 0.37.1
tensorflow-probability       0.23.0
termcolor                    2.5.0
terminado                    0.18.1
tinycss2                     1.4.0
torch                        2.5.1
tornado                      6.4.2
tqdm                         4.67.1
traitlets                    5.14.3
triton                       3.1.0
typeguard                    2.13.3
types-python-dateutil        2.9.0.20241206
typing_extensions            4.12.2
tzdata                       2024.2
unattended-upgrades          0.1
uri-template                 1.3.0
urllib3                      2.2.3
wadllib                      1.3.6
wandb                        0.19.1
wcwidth                      0.2.13
webcolors                    24.11.1
webencodings                 0.5.1
websocket-client             1.8.0
Werkzeug                     3.1.3
wheel                        0.37.1
widgetsnbextension           4.0.13
wrapt                        1.14.1
zipp                         1.0.0

And here is the output of a successful run:

Using cpu device
MoviePy - Building video /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4.
MoviePy - Writing video /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
wandb: WARNING Found log directory outside of given root_logdir, dropping given root_logdir for event file in ../tmp/tests/runs/khcb9wj0/PPO_1

MoviePy - Done !
MoviePy - video ready /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
Logging to ../tmp/tests/runs/khcb9wj0/PPO_1
Saving video to /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
MoviePy - Building video /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4.
MoviePy - Writing video /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
                                                                        

MoviePy - Done !
MoviePy - video ready /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
-----------------------------
| time/              |      |
|    fps             | 1414 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 1299         |
|    iterations           | 2            |
|    time_elapsed         | 3            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0090919435 |
|    clip_fraction        | 0.119        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.685       |
|    explained_variance   | 0.0122       |
|    learning_rate        | 0.0003       |
|    loss                 | 7.24         |
|    n_updates            | 10           |
|    policy_gradient_loss | -0.0188      |
|    value_loss           | 49.9         |
------------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1270        |
|    iterations           | 3           |
|    time_elapsed         | 4           |
|    total_timesteps      | 6144        |
| train/                  |             |
|    approx_kl            | 0.009359399 |
|    clip_fraction        | 0.0527      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.666      |
|    explained_variance   | 0.112       |
|    learning_rate        | 0.0003      |
|    loss                 | 13.9        |
|    n_updates            | 20          |
|    policy_gradient_loss | -0.0165     |
|    value_loss           | 33.7        |
-----------------------------------------

@curtiscjohnson
Copy link

I'm also experiencing this issue after a recent upgrade to v2.4.0.

@araffin
Copy link
Member

araffin commented Dec 18, 2024

might be related to #2061

help is welcomed to solve the issue =)

@araffin
Copy link
Member

araffin commented Dec 20, 2024

@OliverUrbann could you try with #2063 ?
it might solve your issue

@OliverUrbann
Copy link
Author

Thx! However, it still fails. Just to double check:

pip install git+https://github.com/DLR-RM/stable-baselines3.git@fix/video-record
...
pip list | grep stable 
stable-baselines3            2.5.0a1

And here is the test output:

wandb: Currently logged in as:.... Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.19.1
wandb: Run data is saved locally in ...
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run swift-bush-15
wandb: ⭐️ View project at ...
wandb: 🚀 View run at ...
error: XDG_RUNTIME_DIR not set in the environment.
Using cpu device
MoviePy - Building video /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4.
MoviePy - Writing video /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4

MoviePy - Done !                                            
MoviePy - video ready /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
wandb: WARNING Found log directory outside of given root_logdir, dropping given root_logdir for event file in ../tmp/tests/runs/3fwsvh8f/PPO_1
Logging to ../tmp/tests/runs/3fwsvh8f/PPO_1
Saving video to /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
MoviePy - Building video /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4.
MoviePy - Writing video /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4

MoviePy - Done !                                                        
MoviePy - video ready /home/devil/tmp/tests/videos/agent-CartPole-v1-step-0-to-step-100.mp4
-----------------------------
| time/              |      |
|    fps             | 1470 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1417        |
|    iterations           | 2           |
|    time_elapsed         | 2           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008298077 |
|    clip_fraction        | 0.0771      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.687      |
|    explained_variance   | 0.000942    |
|    learning_rate        | 0.0003      |
|    loss                 | 7.24        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0115     |
|    value_loss           | 47.4        |
-----------------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 1403        |
|    iterations           | 3           |
|    time_elapsed         | 4           |
|    total_timesteps      | 6144        |
| train/                  |             |
|    approx_kl            | 0.010035685 |
|    clip_fraction        | 0.0683      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.668      |
|    explained_variance   | 0.0891      |
|    learning_rate        | 0.0003      |
|    loss                 | 14.5        |
|    n_updates            | 20          |
|    policy_gradient_loss | -0.0176     |
|    value_loss           | 35.1        |
-----------------------------------------
wandb: updating run config
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:                global_step ▁▅▅▅▅▅▅▅▅▅▅██████████
wandb:                   time/fps █▂▁
wandb:            train/approx_kl ▁█
wandb:        train/clip_fraction █▁
wandb:           train/clip_range ▁▁
wandb:         train/entropy_loss ▁█
wandb:   train/explained_variance ▁█
wandb:        train/learning_rate ▁▁
wandb:                 train/loss ▁█
wandb: train/policy_gradient_loss █▁
wandb:           train/value_loss █▁
wandb: 
wandb: Run summary:
wandb:                global_step 6144
wandb:                   time/fps 1403
wandb:            train/approx_kl 0.01004
wandb:        train/clip_fraction 0.06831
wandb:           train/clip_range 0.2
wandb:         train/entropy_loss -0.66834
wandb:   train/explained_variance 0.08907
wandb:        train/learning_rate 0.0003
wandb:                 train/loss 14.49088
wandb: train/policy_gradient_loss -0.01757
wandb:           train/value_loss 35.06071
wandb: 
wandb: 🚀 View run swift-bush-15 at: ...
wandb: ⭐️ View project at: ...
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: ...
True
FAIL

======================================================================
FAIL: test_video_upload (test_video.TestWandbVideoUpload.test_video_upload)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/devil/MoToFlex/tests/test_video.py", line 65, in test_video_upload
    self.assertTrue(len(video_files) > 0, "The video was not uploaded to wandb.")
AssertionError: False is not true : The video was not uploaded to wandb.

----------------------------------------------------------------------
Ran 1 test in 45.509s

FAILED (failures=1)
Finished running tests!

Also checked 2.3.2 again, and it still works.

@araffin
Copy link
Member

araffin commented Dec 20, 2024

thanks for trying =)
I've dig more into the issue and I think I found the root cause.

The problem comes from W&B client: https://github.com/wandb/wandb/blob/8dd25cab52da3603022e75322c847de4def21b1c/wandb/integration/gym/__init__.py#L68

With Gymnasium v1.0, the previous recorder was removed (see wandb/wandb#7047 and #1837), so to be compatible with gymnasium v0.29.1 and v1.0, sb3 doesn't use the gym recorder class anymore (which was monkey-patched by W&B client to upload videos).
Long story short, the W&B client/callback has to be updated.

EDIT: in the meantime you can manually call wandb.log(): https://github.com/wandb/wandb/blob/8dd25cab52da3603022e75322c847de4def21b1c/wandb/integration/gym/__init__.py#L80

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working more information needed Please fill the issue template completely
Projects
None yet
Development

No branches or pull requests

3 participants