You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given that the OpenAI Gym environment MountainCar-v0 ALWAYS returns -1.0 as a reward (even when goal is achieved), I don't understand how DQN with experience-replay converges, yet I know it does, because I have working code (basically your awesome code, that is) that proves it.
It is my understanding that ultimately there needs to be a "sparse reward" that is found. Yet as far as I can see from the openAI Gym code, there is never any reward other than -1. It feels more like a "no reward" environment.
What almost answers my question, but in fact does not: when the task is completed quickly, the return (sum of rewards) of the episode is larger. So if the car never finds the flag, the return is -1000. If the car finds the flag quickly the return might be -200. The reason this does not answer my question is because with DQN and experience replay, those returns (-1000, -200) are never present in the experience replay memory. All the memory has are tuples of the form (state, action, reward, next_state), and of course tuples are pulled from memory at random, not episode-by-episode.
If reaching the flag yielded a reward of +1 (or 100) etc.... things would make more sense to me...
So, I don't see anything in the memory that indicates that the episode was performed well.
And thus, I have no idea why this DQN code is working for MountainCar.
PS: I asked this question on your blog too (as a comment). Apologies for duplication -- I'm not sure where you look and don't look :)
The text was updated successfully, but these errors were encountered:
Given that the OpenAI Gym environment MountainCar-v0 ALWAYS returns -1.0 as a reward (even when goal is achieved), I don't understand how DQN with experience-replay converges, yet I know it does, because I have working code (basically your awesome code, that is) that proves it.
It is my understanding that ultimately there needs to be a "sparse reward" that is found. Yet as far as I can see from the openAI Gym code, there is never any reward other than -1. It feels more like a "no reward" environment.
What almost answers my question, but in fact does not: when the task is completed quickly, the return (sum of rewards) of the episode is larger. So if the car never finds the flag, the return is -1000. If the car finds the flag quickly the return might be -200. The reason this does not answer my question is because with DQN and experience replay, those returns (-1000, -200) are never present in the experience replay memory. All the memory has are tuples of the form (state, action, reward, next_state), and of course tuples are pulled from memory at random, not episode-by-episode.
If reaching the flag yielded a reward of +1 (or 100) etc.... things would make more sense to me...
So, I don't see anything in the memory that indicates that the episode was performed well.
And thus, I have no idea why this DQN code is working for MountainCar.
PS: I asked this question on your blog too (as a comment). Apologies for duplication -- I'm not sure where you look and don't look :)
The text was updated successfully, but these errors were encountered: