DQN mnist & mountain car performance #20

pluebcke · 2020-03-25T19:19:11Z

Hi,

while working on a PyTorch DQN agent for BSuite experiments, I noticed quite bad results on the mnist and mountain car experiments. I see that a similar question was addressed here, but the thread was closed.

To further investigate, I created a new conda environment, downloaded and installed a fresh copy of BSuite and ran the DQN agent from the baselines. The only settings I've changed were "bsuite_id" to "SWEEP" and the save path.

When you compare the results from both agents with the barplot on page 16 of the BSuite manuscript, you notice that both agents have worse performance on mnist and mountaincar and better performance on catch.

Were there any changes on the environments that I missed? The DQN agent from the manuscript did use the default parameters from the baseline directory, correct?

Thanks,
Peter

iosband · 2020-04-01T14:05:17Z

Hi Peter!

Thanks for raising this... I think we might have seen some slippage in agent performance.

I'm not sure if this has come from updates to:

Tensorflow 1 --> 2
Different parameter settings
Changes to random seed

My suspicion is there are some small details in this migration TF1->TF2 that changed some scores (the agents aren't exactly the same).
We will look into this and then re-run baselines with updated numbers.

Many thanks,
Ian

iosband · 2020-08-10T15:06:20Z

Hello again!

I have just run the agents checked in at HEAD and I did not see your observed scoring...

We may need to tool in some more continuous testing, but the scores on mnist in particular seem "off" for the DQN implementation.

Can you confirm this is still an issue for you?
Is this poor performance for your implementation of DQN, or the baseline implementation we provide?

mklissa · 2020-08-17T18:30:57Z

I also have a similar observation concerning MountainCar, however it is related to the AC algorithm. To me it seems like there is a major difference between the results reported in the paper (close to 1) versus the ones in this thread (close to 0). I have also tried running actor_critic_rnn on mountain_car and it does not seem to learn (on default hyper params).

iosband · 2020-08-20T09:02:28Z

Yes @mklissa - I see that difference above.

There have been several moving pieces:

Changes to the bsuite environment (at release MountainCar had gravity the wrong direction)
Changes to agent (migrate from TF1 -> TF2/JAX)

However, I think the best approach is to go from what is at HEAD and start a new issue to update paper/reference colabs to incorporate this bug fix.

pluebcke · 2020-08-30T18:28:57Z

Dear Ian,

Thanks for looking into this! Back in march, I observed poor performance on mnist with both the baseline implementation as well as my own implementation of DQN.

Given that mnist seems to work perfectly fine for you, I assume there must be some problem on my side. I will set up a system from scratch and run the baseline implementation of dqn again. It might take a while though until I find time to do that.

Best regards,

Peter

pluebcke · 2020-09-29T12:14:04Z

Hi,
I finally found some time to look into the issue. On my laptop, the performance of the baselines tensorflow DQN agent is still quite bad (with a score somewhere around 0.25 in the bar plots above).

I used a fresh install of pop!_os 20.04 (distribution based on Ubuntu) and then performed as few steps as possible to run the agent:

Installed Anaconda
Created a Conda environment with Python 3.7
Steps from the bsuite github page:

pip install --upgrade pip setuptools
pip install bsuite
pip install bsuite[baselines]

Opened bsuite/bsuite/baselines/tf/dqn/run.py, changed the bsuite_id to 'SWEEP' and set the 'verbose' flag to False
python run.py

Hope this helps.

Best regards,
Peter

iosband · 2020-09-29T12:55:00Z

Ah... OK well I think in order to get the claimed performance, you need to run the dqn.default_agent()

I can see that this is a bit confusing, but we wanted to expose the flags as an easy way for people to tinker!
If you instead fo to baselines/tf/run.py then you should be able to get the same behaviour... is that right?

BTW... do you think we should instead remove the flag options and avoid this kind of confusion?

pluebcke · 2020-10-10T12:33:54Z

I would keep the flag options but maybe have the same values as the default agent as default.

As you suggested, I replaced the agent (that uses the flags) with the dqn.default_agent() in run.py and ran the experiments again. Unfortunately no improvement on the mnist experiments.

@mklissa you said you observed something similar on MountainCar. Did the mnist experiments work for you? If I'm the only one experiencing this problem, then there might just be some issue on my side.

Best regards,
Peter

jbarsce · 2020-11-18T21:23:32Z

Hey there, I'm writing to report that I'm also experiencing the same problem as @pluebcke in MNIST. I couldn't replicate the good MNIST results as reported in the paper. I also noticed a bad performance (0.2-0.26 score at most) using PPO and DQN agents from an external library (stable-baselines), tried different hyper-parameters, number of layers/neurons, activation functions, with no effect. I also checked the MNIST env implementation offered here and seemed OK to me.

Today I created a new virtual env with the latest BSuite version with the baselines, runned the 20 seeds twice and the baseline DQN agent also scored 0.23. This also happens with noise and scale variants.

iosband · 2020-11-19T10:05:54Z

Hi @jbarsce - I'm not sure I understand the question.

So, are you saying that:
(a) The TF DQN checked into bsuite.baselines is not solving the bandit task for you?
(b) Another agent is unable to solve the MNIST task?

We have some tools for testing this internally within Google/DeepMind... and based on that I'm confident that the bsuite/baselines/jax/dqn and bsuite/baselines/tf/dqn do reproduce the performance.

However... we clearly need to work out a way to share these tests/reproducibility/installation instructions so that this confusion does not arise.

iosband · 2020-11-19T12:19:05Z

For reference, here is a record of the nightly runs for the TF baselines.
You can see that some of the experiments are a little noisy... but that the DQN TF is consistently reproducing the MNIST results abov.

jbarsce · 2020-11-19T20:32:14Z

Hi Ian, thanks for the quick reply! yes, I ran the BSuite experiments with another DQN agent and noticed that, while the other envs performed similar than in the accompanying paper, MNIST was the only that underperformed.

As this external agent had several variations, I tried to replicate the results with the DQN agent from this repo, trying tf and jax and isolating them in a new virtual environment. In case they are of any help, the following were the steps I followed (I took them from the jax repo and from here)

Created a conda environment with python==3.6
pip install --upgrade pip setuptools
pip install bsuite[baselines]

For tensorflow 2.1

I ran the experiments with

python bsuite/bsuite/baselines/tf/dqn/run.py --bsuite_id=MNIST

For jax

pip install git+https://github.com/deepmind/dm-haiku
pip install --upgrade jax jaxlib
pip install git+git://github.com/deepmind/optax.git
pip install git+git://github.com/deepmind/rlax.git
python bsuite/bsuite/baselines/jax/dqn/run.py --bsuite_id=MNIST

Environment: Ubuntu 18.04 bionic

Please let me know if you need any other information. Finally, thanks for this great repository
Juan

pluebcke · 2020-11-20T11:57:04Z

Just a wild guess, maybe something went wrong with the download of the input mnist dataset for Juan and me?

iosband · 2020-11-20T15:26:29Z

Yes interesting... something is getting lost between the version that is checked in to Google3 and the settings you are running.

@yotam @aslanides and I will have a look into this...

Going to keep this open for now and try to reproduce this...

braham-snyder · 2022-08-14T13:02:26Z

Repro in ~10 lines (excluding imports): https://colab.research.google.com/drive/1XtTv-p2bXfvMBT_77cWjWRHPXIvimWlO?usp=sharing

iliasdf · 2024-05-23T20:37:08Z

Hi,
Recently, I also started working with Bsuite, which I find a lovely environment to work with. I noticed as well that the results of the MNIST experiment are off. Apparently, you have discussed this issue a while back already. Did you find a solution in the meantime already?
Thanks,
Ilias

aslanides self-assigned this Apr 16, 2020

aslanides added the baselines Issues pertaining to baseline agents. label Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DQN mnist & mountain car performance #20

DQN mnist & mountain car performance #20

pluebcke commented Mar 25, 2020

iosband commented Apr 1, 2020

iosband commented Aug 10, 2020

mklissa commented Aug 17, 2020

iosband commented Aug 20, 2020

pluebcke commented Aug 30, 2020

pluebcke commented Sep 29, 2020

iosband commented Sep 29, 2020

pluebcke commented Oct 10, 2020 •

edited

Loading

jbarsce commented Nov 18, 2020

iosband commented Nov 19, 2020

iosband commented Nov 19, 2020

jbarsce commented Nov 19, 2020 •

edited

Loading

pluebcke commented Nov 20, 2020

iosband commented Nov 20, 2020

braham-snyder commented Aug 14, 2022 •

edited

Loading

iliasdf commented May 23, 2024

DQN mnist & mountain car performance #20

DQN mnist & mountain car performance #20

Comments

pluebcke commented Mar 25, 2020

iosband commented Apr 1, 2020

iosband commented Aug 10, 2020

mklissa commented Aug 17, 2020

iosband commented Aug 20, 2020

pluebcke commented Aug 30, 2020

pluebcke commented Sep 29, 2020

iosband commented Sep 29, 2020

pluebcke commented Oct 10, 2020 • edited Loading

jbarsce commented Nov 18, 2020

iosband commented Nov 19, 2020

iosband commented Nov 19, 2020

jbarsce commented Nov 19, 2020 • edited Loading

pluebcke commented Nov 20, 2020

iosband commented Nov 20, 2020

braham-snyder commented Aug 14, 2022 • edited Loading

iliasdf commented May 23, 2024

pluebcke commented Oct 10, 2020 •

edited

Loading

jbarsce commented Nov 19, 2020 •

edited

Loading

braham-snyder commented Aug 14, 2022 •

edited

Loading