OpenAI Gym and Python for Q-learning - Reinforcement Learning Code Project
text
OpenAI Gym and Python set up for Q-learning
What's up, guys? Over the next couple of posts, we're going to be building and playing our very first game with reinforcement learning!
We're going to use the knowledge we gained last time about Q-learning to teach an agent how to play a game called Frozen Lake. We'll be using Python and Gymnasium (previously known as OpenAI Gym), to develop our algorithm. So let's get to it!
Gymnasium
As mentioned we'll be using Python and Gymnasium to develop our reinforcement learning algorithm. The Gym library is a collection of environments that we can use with the reinforcement learning algorithms we develop.
Gym has a ton of environments ranging from simple text based games to Atari games like Breakout and Space Invaders. The library is intuitive to use and simple to install. Just run
pip install gymnasium
, and you're good to go!
We'll be making use of Gym to provide us with an environment for a simple game called Frozen Lake. We'll then train an agent to play the game using Q-learning, and we'll get a playback of how the agent does after being trained.
So, let's jump into the details for Frozen Lake!
Frozen Lake
I've grabbed the description of the game directly from Gym's website. Let's read through it together.
Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend. The surface is described using a grid like the following:
SFFF FHFH FFFH HFFG
This grid is our environment where
S
is the agent's starting point, and it's safe.
F
represents the frozen surface and is also safe.
H
represents a hole, and if our agent steps in a hole in the middle of a frozen lake, well, that's not good. Finally,
G
represents the goal, which is the space on the grid where the prized frisbee is located.
The agent can navigate left, right, up, and down, and the episode ends when the agent reaches the goal or falls in a hole. It receives a reward of one if it reaches the goal, and zero otherwise.
State | Description | Reward |
---|---|---|
S
|
Agent's starting point - safe | 0 |
F
|
Frozen surface - safe | 0 |
H
|
Hole - game over | 0 |
G
|
Goal - game over | 1 |
Alright, so you got it? Our agent has to navigate the grid by staying on the frozen surface without falling into any holes until it reaches the frisbee. If it reaches the frisbee, it wins with a reward of plus one. If it falls in a hole, it loses and receives no points for the entire episode.
Cool! Let's jump into the code!
Setting up Frozen Lake in code
The code we'll be working with largely follows Thomas Simonini's Frozen Lake Q-learning implementation with some slight modifications.
Libraries
First we're importing all the libraries we'll be using. Not many, really...
Numpy
,
gymnasium
,
random
,
time
, and
clear_output
from
Ipython
's
display
.
import numpy as np
import gymnasium as gym
import random
import time
from IPython.display import clear_output
Creating the environment
Next, to create our environment, we just call
gym.make()
and pass a string of the name of the environment we want to set up. We'll be using the environment
FrozenLake-v1
.
env = gym.make('FrozenLake-v1', render_mode='ansi')
With this
env
object, we're able to query for information about the environment, sample states and actions, retrieve rewards, and have our agent navigate the frozen lake. That's all made
available to us conveniently with Gym.
Creating the Q-table
We're now going to construct our Q-table, and initialize all the Q-values to zero for each state-action pair.
Remember, the number of rows in the table is equivalent to the size of the state space in the environment, and the number of columns is equivalent to the size of the action space. We can get this information using using
env.observation_space.n
and
env.action_space.n
, as shown below. We can then use this information to build the Q-table and fill it with zeros.
action_space_size = env.action_space.n state_space_size = env.observation_space.n q_table = np.zeros((state_space_size, action_space_size))
If you're foggy about Q-tables at all, be sure to check out the earlier post where we covered all the details you need for Q-tables.
Alright, here's our Q-table!
print(q_table)
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
Initializing Q-learning parameters
Now, we're going to create and initialize all the parameters needed to implement the Q-learning algorithm.
num_episodes = 10000
max_steps_per_episode = 100
learning_rate = 0.1
discount_rate = 0.99
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.01
Let's step through each of these.
First, with
num_episodes
, we define the total number of episodes we want the agent to play during training. Then, with
max_steps_per_episode
, we define a maximum number of steps that our agent is allowed to take within a single episode. So, if by the one-hundredth step, the agent hasn't reached the frisbee
or fallen through a hole, then the episode will terminate with the agent receiving zero points.
Next, we set our
learning_rate
, which was mathematically shown using the symbol \(\alpha\) in the
previous post. Then, we also set our
discount_rate
, as well, which was represented with the symbol \(\gamma\) previously.
Now, the last four parameters are all for related to the exploration-exploitation trade-off we talked about
last time in regards to the epsilon-greedy policy. We're initializing our
exploration_rate
to
1
and setting the
max_exploration_rate
to
1
and a
min_exploration_rate
to
0.01
. The max and min are just bounds to how large or small our exploration rate can be. Remember, the exploration rate was represented with the symbol \(\epsilon\) when we discussed it previously.
Lastly, we set the
exploration_decay_rate
to
0.01
to determine the rate at which the
exploration_rate
will decay.
Now, all of these parameters can change. These are the parameters you'll want to play with and tune yourself to see how they influence and change the performance of the algorithm.
Wrapping up
Speaking of which, in the next post, we're going to jump right into the code that we'll write to implement the actual Q-learning algorithm for playing Frozen Lake.
For now, go ahead and make sure your environment is set up with Python and Gym and that you've got the initial code written that we went through so far.
Let me know in the comments if you were able to get everything up and running, and I'll see ya in the next post where we'll implement our first reinforcement learning algorithm!
quiz
resources
updates
Committed by on