Skip to content

There are 4 locations (labeled by different letters), and our job is to pick up the passenger at one location and drop him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is also a 10 points penalty for illegal pick-up and drop-off actions

Notifications You must be signed in to change notification settings

MahmoudYahiaAhmed/Reinforcement-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Overview

This task aims to learn how to deal with Reinforcement Learning regarding this reference. In addition to that, it's required to do the following tasks:

  1. Turn this code into a module of functions that can use multiple environments.

  2. Tune alpha, gamma, and/or epsilon using a decay over episodes.

  3. Implement a grid search to discover the best hyperparameters.

#Setup First, It's needed to install the following libraries while dealing with Windows OS:

!pip install cmake "gym[atari]" scipy

Furthermore, It's needed to import some important libraries:

import gym
from IPython.display import clear_output
from time import sleep
import random
from IPython.display import clear_output
import numpy as np
import pandas as pd

#Training starting with giving the environment to the train function then return the Q table which contain the knowledge.

#Evaluation we evaluating the model by providing its env and the Q-table that generated from training to get the penalty score and the timesteps. #Requirements

# "Taxi-v3" Environment
Q,f=train(env,10000,0.8,0.9,0.8)

##########################################
#	Episode: 10000									  
#	Training finished.										
#																 
#	Results after 100 episodes:					 
#	Average timesteps per episode: 13.24	
#	Average penalties per episode: 0.0		  

Tuning using decay over episodes

It's required to change the hyperparameters while training, so we changed the hyperparameters each every 5000. alpha=alpha*(1-0.01) gamma=gamma*(1-0.01) eps=eps*(1-0.6)

Implementing Grid Search

It's required to implement Grid Search to find the best combinations of hyper parameters values to get the minimum penalty and minimum steptime.

parameters = {'0.6,0.6,0.7'}
grid(env,alphas,gammas,epsilons)
#########################
after getting "all params" and append all params into list of dictonaries then sorting them. 

#Best parameters are: {'alpha': 0.6, 'gamma': 0.6, 'epsilon': 0.7, 'penalty': 0.0, 'time step': 12.89}

##########################################
#	Episode: 100000
#	Training finished.
#
#	Results after 1000 episodes:
#	Average timesteps per episode: 13.04
#	Average penalties per episode: 0.0 
##########################################

Reinforcement Learning Project

Table of Contents

Overview :

Available Environments
1-Taxi-v3
2-FrozenLake-v1
3-CliffWalking-v0
you can choose from them and the defualt one is Taxi Proplem definition : There are 4 locations (labeled by different letters), and our job is to pick up the passenger at one location and drop him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is also a 10 points penalty for illegal pick-up and drop-off actions. Introduction In this project we will implement the Q-leaning algorithm and will see how the decay of the hyperparameter such as learning rate and discount factor and eplison will effect the results and we will implement a grid search to select the best parameters.

Requirements :

It is required to setup this libraries to run the project

!pip install gym
!pip install numpy

Defualt environment info

Taxi-env The job is to pick up the passenger at one location and drop them off in another. Here are a few things that we'd love our taxi to take care of:

  • Drop off the passenger to the right location.
  • Save passenger's time by taking minimum time possible to drop off
  • Take care of passenger's safety and traffic rules

Training :

Random Action :

Trying random actions to see how the agent movements

Random Actions

Q-Learning

After the agent has been trained

Q-learning

We can notice the difference and how the agent has been trained

Decaying hyper parameters while training

Decay

Evaluation

Eval_100

Grid search

We used a brute force algorithm to get the best hyper parameter

Grid search values

Best hyperparameters :

alpha=0.9 ,gamma=0.9, epsilon=0.9

About

There are 4 locations (labeled by different letters), and our job is to pick up the passenger at one location and drop him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is also a 10 points penalty for illegal pick-up and drop-off actions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published