Skip to content

Commit

Permalink
Proofread recaps and week 1 assignments
Browse files Browse the repository at this point in the history
  • Loading branch information
mmamedli authored and dniku committed Jan 12, 2021
1 parent b9e0b93 commit 1e936b0
Show file tree
Hide file tree
Showing 5 changed files with 187 additions and 153 deletions.
30 changes: 15 additions & 15 deletions week1_intro/crossentropy_method.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"source": [
"# Crossentropy method\n",
"\n",
"This notebook will teach you to solve reinforcement learning problems with crossentropy method. We'll follow-up by scaling everything up and using neural network policy."
"This notebook will teach you to solve reinforcement learning problems with crossentropy method. After that we'll scale everything up using neural network policy."
]
},
{
Expand All @@ -24,8 +24,8 @@
"\n",
" !touch .setup_complete\n",
"\n",
"# This code creates a virtual display to draw game images on.\n",
"# It will have no effect if your machine has a monitor.\n",
"# This code creates a virtual display for drawing game images on.\n",
"# It won't have any effect if your machine has a monitor.\n",
"if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n",
" !bash ../xvfb start\n",
" os.environ['DISPLAY'] = ':1'"
Expand Down Expand Up @@ -69,7 +69,7 @@
"\n",
"Since we still use integer state and action representations, you can use a 2-dimensional array to represent the policy.\n",
"\n",
"Please initialize the policy __uniformly__, that is, probabililities of all actions should be equal."
"Please initialize the policy __uniformly__, that is, the probabililities of all actions should be equal."
]
},
{
Expand Down Expand Up @@ -114,9 +114,9 @@
"source": [
"def generate_session(env, policy, t_max=10**4):\n",
" \"\"\"\n",
" Play game until end or for t_max ticks.\n",
" :param policy: an array of shape [n_states,n_actions] with action probabilities\n",
" :returns: list of states, list of actions and sum of rewards\n",
" Play the game until the end or for t_max ticks.\n",
" :param policy: an array of shape [n_states,n_actions] with the action probabilities\n",
" :returns: list of states, list of actions and the sum of rewards\n",
" \"\"\"\n",
" states, actions = [], []\n",
" total_reward = 0.\n",
Expand Down Expand Up @@ -198,7 +198,7 @@
" [i.e. sorted by session number and timestep within session]\n",
"\n",
" If you are confused, see examples below. Please don't assume that states are integers\n",
" (they will become different later).\n",
" (their type will change later).\n",
" \"\"\"\n",
"\n",
" reward_threshold = <YOUR CODE: compute minimum reward for elite sessions. Hint: use np.percentile()>\n",
Expand Down Expand Up @@ -267,7 +267,7 @@
" policy[s_i,a_i] ~ #[occurrences of s_i and a_i in elite states/actions]\n",
"\n",
" Don't forget to normalize the policy to get valid probabilities and handle the 0/0 case.\n",
" For states that you never visited, use a uniform distribution (1/n_actions for all states).\n",
" For states, that you never visited, use a uniform distribution (1/n_actions for all states).\n",
"\n",
" :param elite_states: 1D list of states from elite sessions\n",
" :param elite_actions: 1D list of actions from elite sessions\n",
Expand Down Expand Up @@ -387,21 +387,21 @@
"\n",
" policy = learning_rate * new_policy + (1 - learning_rate) * policy\n",
"\n",
" # display results on chart\n",
" # display results on the chart\n",
" show_progress(rewards_batch, log, percentile)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reflecting on results\n",
"### Reflecting on the results\n",
"\n",
"You may have noticed that the taxi problem quickly converges from less than -1000 to a near-optimal score and then descends back into -50/-100. This is in part because the environment has some innate randomness. Namely, the starting points of passenger/driver change from episode to episode.\n",
"You may have noticed that the taxi problem quickly converges from less than -1000 to a near-optimal score and then descends back to -50/-100. This is in part because the environment has some innate randomness. Namely, the starting points of passenger/driver change from episode to episode.\n",
"\n",
"In case CEM failed to learn how to win from one distinct starting point, it will simply discard it because no sessions from that starting point will make it into the \"elites\".\n",
"In case CEM failed to learn, how to win from one distinct starting point, it will simply discard it because no sessions from that starting point will make it into the \"elites\".\n",
"\n",
"To mitigate that problem, you can either reduce the threshold for elite sessions (duct tape way) or change the way you evaluate strategy (theoretically correct way). For each starting state, you can sample an action randomly, and then evaluate this action by running _several_ games starting from it and averaging the total reward. Choosing elite sessions with this kind of sampling (where each session's reward is counted as the average of the rewards of all sessions with the same starting state and action) should improve the performance of your policy."
"To mitigate that problem, you can either reduce the threshold for elite sessions (duct tape way) or change the way you evaluate the strategy (theoretically correct way). For each starting state, you can sample an action randomly, and then evaluate this action by running _several_ games starting from it and averaging the total reward. Choosing elite sessions with this kind of sampling (where each session reward is counted as the average of the rewards of all sessions with the same starting state and action) should improve the performance of your policy."
]
},
{
Expand Down Expand Up @@ -429,5 +429,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 0
}
36 changes: 19 additions & 17 deletions week1_intro/deep_crossentropy_method.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@
"\n",
" !touch .setup_complete\n",
"\n",
"# This code creates a virtual display to draw game images on.\n",
"# It will have no effect if your machine has a monitor.\n",
"# This code creates a virtual display for drawing game images on.\n",
"# It won't have any effect if your machine has a monitor.\n",
"if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n",
" !bash ../xvfb start\n",
" os.environ['DISPLAY'] = ':1'"
Expand Down Expand Up @@ -65,8 +65,8 @@
"\n",
"For this assignment we'll utilize the simplified neural network implementation from __[Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)__. Here's what you'll need:\n",
"\n",
"* `agent.partial_fit(states, actions)` - make a single training pass over the data. Maximize the probabilitity of :actions: from :states:\n",
"* `agent.predict_proba(states)` - predict probabilities of all actions, a matrix of shape __[len(states), n_actions]__\n"
"* `agent.partial_fit(states, actions)` - makes a single training pass over the data. Maximize the probabilitity of :actions: from :states:\n",
"* `agent.predict_proba(states)` - predicts probabilities of all actions, a matrix of shape __[len(states), n_actions]__\n"
]
},
{
Expand All @@ -82,7 +82,7 @@
" activation='tanh',\n",
")\n",
"\n",
"# initialize agent to the dimension of state space and number of actions\n",
"# initialize agent to the dimension of state space and a number of actions\n",
"agent.partial_fit([env.reset()] * n_actions, range(n_actions), range(n_actions))"
]
},
Expand All @@ -107,7 +107,7 @@
" # use agent to predict a vector of action probabilities for state :s:\n",
" probs = <YOUR CODE>\n",
"\n",
" assert probs.shape == (env.action_space.n,), \"make sure probabilities are a vector (hint: np.reshape)\"\n",
" assert probs.shape == (env.action_space.n,), \"make sure that the probabilities are a vector (hint: np.reshape)\"\n",
" \n",
" # use the probabilities you predicted to pick an action\n",
" # sample proportionally to the probabilities, don't just take the most likely action\n",
Expand Down Expand Up @@ -168,7 +168,7 @@
" [i.e. sorted by session number and timestep within session]\n",
"\n",
" If you are confused, see examples below. Please don't assume that states are integers\n",
" (they will become different later).\n",
" (their type will change later).\n",
" \"\"\"\n",
"\n",
" <YOUR CODE: copy-paste your implementation from the previous notebook>\n",
Expand Down Expand Up @@ -274,7 +274,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Show video. This may not work in some setups. If it doesn't\n",
"# Show video. In some setups this may not work. If it doesn't\n",
"# work for you, you can download the videos and view them locally.\n",
"\n",
"from pathlib import Path\n",
Expand All @@ -297,23 +297,23 @@
"\n",
"By this moment you should have got enough score on [CartPole-v0](https://gym.openai.com/envs/CartPole-v0) to consider it solved (see the link). It's time to try something harder.\n",
"\n",
"_if you have any trouble with CartPole-v0 and feel stuck, take a look at the forums_\n",
"_if you have any trouble with CartPole-v0 and feel stuck, take a look on forums_\n",
"\n",
"Your assignment is to obtain average reward of __at least -150__ on `MountainCar-v0`.\n",
"Your assignment is to obtain an average reward of __at least -150__ on `MountainCar-v0`.\n",
"\n",
"See the tips section below, it's kinda important.\n",
" \n",
"* Bonus quest: Devise a way to speed up training against the default version\n",
" * Obvious improvement: use [joblib](https://www.google.com/search?client=ubuntu&channel=fs&q=joblib&ie=utf-8&oe=utf-8)\n",
" * Try re-using samples from 3-5 last iterations when computing threshold and training\n",
" * Experiment with amount of training iterations and learning rate of the neural network (see params)\n",
" * Try re-using samples from 3-5 last iterations when computing threshold and during training\n",
" * Experiment with an amount of training iterations and the learning rate of the neural network (see params)\n",
" \n",
" \n",
"### Tips\n",
"* Gym page: [MountainCar](https://gym.openai.com/envs/MountainCar-v0)\n",
"* Sessions for MountainCar may last for 10k+ ticks. Make sure ```t_max``` param is at least 10k.\n",
" * Also it may be a good idea to cut rewards via \">\" and not \">=\". If 90% of your sessions get reward of -10k and 10% are better, than if you use percentile 20% as threshold, R >= threshold __fails cut off bad sessions__ whule R > threshold works alright.\n",
"* _issue with gym_: Some versions of gym limit game time by 200 ticks. This will prevent cem training in most cases. Make sure your agent is able to play for the specified __t_max__, and if it isn't, try `env = gym.make(\"MountainCar-v0\").env` or otherwise get rid of TimeLimit wrapper.\n",
" * Also it may be a good idea to cut rewards via \">\" and not \">=\". If 90% of your sessions get reward of -10k and 10% are better, than if you use percentile 20% as the threshold, R >= threshold __fails cut off bad sessions__ whule R > threshold works alright.\n",
"* _issue with gym_: Some versions of gym limit game time by 200 ticks. This will prevent the training in most cases. Make sure your agent is able to play for the specified __t_max__, and if it isn't, try `env = gym.make(\"MountainCar-v0\").env` or otherwise get rid of TimeLimit wrapper.\n",
"* If it won't train it's a good idea to plot reward distribution and record sessions: they may give you some clue. If they don't, call course staff :)\n",
"* 20-neuron network is probably not enough, feel free to experiment.\n",
"\n",
Expand All @@ -332,7 +332,9 @@
"<Figure size 700x700 with 1 Axes>"
]
},
"metadata": {},
"metadata": {
"tags": []
},
"output_type": "display_data"
}
],
Expand All @@ -359,7 +361,7 @@
" ax.set_xlabel('position (x)')\n",
" ax.set_ylabel('velocity (v)')\n",
" \n",
" # Sample a trajectory and draw it\n",
" # Sample the trajectory and draw it\n",
" states, actions, _ = generate_session(env, agent)\n",
" states = np.array(states)\n",
" ax.plot(states[:, 0], states[:, 1], color='white')\n",
Expand Down Expand Up @@ -416,5 +418,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 0
}
38 changes: 19 additions & 19 deletions week1_intro/gym_interface.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@
"\n",
" !touch .setup_complete\n",
"\n",
"# This code creates a virtual display to draw game images on.\n",
"# It will have no effect if your machine has a monitor.\n",
"# This code creates a virtual display for drawing game images.\n",
"# It has no effect if your machine has a monitor.\n",
"if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n",
" !bash ../xvfb start\n",
" os.environ['DISPLAY'] = ':1'"
Expand All @@ -39,9 +39,9 @@
"source": [
"### OpenAI Gym\n",
"\n",
"We're gonna spend several next weeks learning algorithms that solve decision processes. We are then in need of some interesting decision problems to test our algorithms.\n",
"We're gonna spend several next weeks learning algorithms that solve decision processes. So we need a few interesting decision problems to test our algorithms.\n",
"\n",
"That's where OpenAI Gym comes into play. It's a Python library that wraps many classical decision problems including robot control, videogames and board games.\n",
"That's where OpenAI Gym comes into play. It's a Python library that wraps many classical decision problems, including robot control, videogames and board games.\n",
"\n",
"So here's how it works:"
]
Expand All @@ -66,7 +66,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: if you're running this on your local machine, you'll see a window pop up with the image above. Don't close it, just alt-tab away."
"Note: if you're running this on your local machine, you'll see a window popping up with the image above. Don't close it, just alt-tab away."
]
},
{
Expand All @@ -75,13 +75,13 @@
"source": [
"### Gym interface\n",
"\n",
"The three main methods of an environment are\n",
"* `reset()`: reset environment to the initial state, _return first observation_\n",
"* `render()`: show current environment state (a more colorful version :) )\n",
"* `step(a)`: commit action `a` and return `(new_observation, reward, is_done, info)`\n",
"The three main methods of this environment are:\n",
"* `reset()`: resets an environment to the initial state, _return first observation_\n",
"* `render()`: shows the current environment state (a more colorful version :) )\n",
"* `step(a)`: commits an action `a` and returns `(new_observation, reward, is_done, info)`\n",
" * `new_observation`: an observation right after committing the action `a`\n",
" * `reward`: a number representing your reward for committing action `a`\n",
" * `is_done`: True if the MDP has just finished, False if still in progress\n",
" * `reward`: a number which represents your reward for committing action `a`\n",
" * `is_done`: True if the MDP has just finished, False if it is still in progress\n",
" * `info`: some auxiliary stuff about what just happened. For now, ignore it."
]
},
Expand All @@ -94,7 +94,7 @@
"obs0 = env.reset()\n",
"print(\"initial observation code:\", obs0)\n",
"\n",
"# Note: in MountainCar, observation is just two numbers: car position and velocity"
"# Note: in MountainCar, an observation is just two numbers: car position and velocity"
]
},
{
Expand All @@ -110,7 +110,7 @@
"print(\"reward:\", reward)\n",
"print(\"is game over?:\", is_done)\n",
"\n",
"# Note: as you can see, the car has moved to the right slightly (around 0.0005)"
"# Note: as you can see, the car has moved slightly to the right (around 0.0005)"
]
},
{
Expand All @@ -119,7 +119,7 @@
"source": [
"### Play with it\n",
"\n",
"Below is the code that drives the car to the right. However, if you simply use the default policy, the car will not reach the flag at the far right due to gravity.\n",
"Below is the code that drives the car to the right. However, if you simply use the default policy, the car won't reach the flag at the far right due to the gravity.\n",
"\n",
"__Your task__ is to fix it. Find a strategy that reaches the flag. \n",
"\n",
Expand Down Expand Up @@ -151,14 +151,14 @@
"source": [
"def policy(obs, t):\n",
" # Write the code for your policy here. You can use the observation\n",
" # (a tuple of position and velocity), the current time step, or both,\n",
" # (a tuple of the position and the velocity), the current time step, or both,\n",
" # if you want.\n",
" position, velocity = obs\n",
" \n",
" # This is an example policy. You can try running it, but it will not work.\n",
" # This is an example policy. You can try running it, but it won't work.\n",
" # Your goal is to fix that. You don't need anything sophisticated here,\n",
" # and you can hard-code any policy that seems to work.\n",
" # Hint: think how you would make a swing go farther and faster.\n",
" # Hint: think how you would make a swing go faster and faster.\n",
" return actions['right']"
]
},
Expand All @@ -178,7 +178,7 @@
" action = policy(obs, t) # Call your policy\n",
" obs, reward, done, _ = env.step(action) # Pass the action chosen by the policy to the environment\n",
" \n",
" # We don't do anything with reward here because MountainCar is a very simple environment,\n",
" # We won't do anything with reward here because MountainCar is a very simple environment,\n",
" # and reward is a constant -1. Therefore, your goal is to end the episode as quickly as possible.\n",
"\n",
" # Draw game image on display.\n",
Expand Down Expand Up @@ -214,5 +214,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 0
}
Loading

0 comments on commit 1e936b0

Please sign in to comment.