Applied Reinforcement Learning with Python. With OpenAI Gym, Tensorflow and Keras (Taweh Beysolow) (z-library.sk, 1lib.sk, z-lib.sk)

Applied Reinforcement Learning with Python Taweh Beysolow II With OpenAI Gym, Tensorflow and Keras

ISBN-13 (pbk): 978-1-4842-5126-3 ISBN-13 (electronic): 978-1-4842-5127-0 https://doi.org/10.1007/978-1-4842-5127-0 Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the book's product page, located at www.apress.com/978-1-4842-5126-3. For more detailed information, please visit http://www.apress.com/source-code. Taweh Beysolow II San Francisco, CA, USA Applied Reinforcement Learning with Python: With OpenAI Gym, Tensorf low and Keras © 2019 by Taweh Beysolow II

Chapter 1: Introduction to Reinforcement Learning �����������������������������1 History of Reinforcement Learning �����������������������������������������������������������������������2 MDPs and their Relation to Reinforcement Learning ��������������������������������������������3 Reinforcement Learning Algorithms and RL Frameworks ������������������������������������7 Q Learning �����������������������������������������������������������������������������������������������������������10 Actor-Critic Models ����������������������������������������������������������������������������������������11 Applications of Reinforcement Learning �������������������������������������������������������������12 Classic Control Problems �������������������������������������������������������������������������������12 Super Mario Bros� ������������������������������������������������������������������������������������������13 Doom �������������������������������������������������������������������������������������������������������������14 Reinforcement-Based Marketing Making ������������������������������������������������������15 Sonic the Hedgehog ��������������������������������������������������������������������������������������������16 Conclusion ����������������������������������������������������������������������������������������������������������17 Chapter 2: Reinforcement Learning Algorithms ���������������������������������19 OpenAI Gym���������������������������������������������������������������������������������������������������������19 Policy-Based Learning ����������������������������������������������������������������������������������������20 Policy Gradients Explained Mathematically ��������������������������������������������������������22 Introduction ����������������������������������������������������������������������������������������xv Contents

Gradient Ascent Applied to Policy Optimization ��������������������������������������������������24 Using Vanilla Policy Gradients on the Cart Pole Problem ������������������������������������25 What Are Discounted Rewards and Why Do We Use Them? �������������������������������29 Drawbacks to Policy Gradients ���������������������������������������������������������������������������36 Proximal Policy Optimization (PPO) and Actor-Critic Models�������������������������������37 Implementing PPO and Solving Super Mario Bros� ���������������������������������������������38 Overview of Super Mario Bros� ����������������������������������������������������������������������39 Installing Environment Package ��������������������������������������������������������������������40 Structure of the Code in Repository ��������������������������������������������������������������40 Model Architecture ����������������������������������������������������������������������������������������41 Working with a More Difficult Reinforcement Learning Challenge ���������������������47 Dockerizing Reinforcement Learning Experiments ���������������������������������������������50 Results of the Experiment �����������������������������������������������������������������������������������52 Conclusion ����������������������������������������������������������������������������������������������������������53 Chapter 3: Reinforcement Learning Algorithms: Q Learning and Its Variants �����������������������������������������������������������������������������������������55 Q Learning �����������������������������������������������������������������������������������������������������������55 Temporal Difference (TD) Learning ���������������������������������������������������������������������57 Epsilon-Greedy Algorithm �����������������������������������������������������������������������������������59 Frozen Lake Solved with Q Learning �������������������������������������������������������������������60 Deep Q Learning �������������������������������������������������������������������������������������������������65 Playing Doom with Deep Q Learning �������������������������������������������������������������������66 Simple Doom Level ����������������������������������������������������������������������������������������71 Training and Performance �����������������������������������������������������������������������������������73 Limitations of Deep Q Learning ���������������������������������������������������������������������������74 Double Q Learning and Double Deep Q Networks �����������������������������������������������74 Conclusion ����������������������������������������������������������������������������������������������������������75

Chapter 4: Market Making via Reinforcement Learning ���������������������77 What Is Market Making? �������������������������������������������������������������������������������������77 Trading Gym ��������������������������������������������������������������������������������������������������������81 Why Reinforcement Learning for This Problem? �������������������������������������������������82 Synthesizing Order Book Data with Trading Gym ������������������������������������������������84 Generating Order Book Data with Trading Gym ���������������������������������������������������85 Experimental Design �������������������������������������������������������������������������������������������87 RL Approach 1: Policy Gradients ��������������������������������������������������������������������90 RL Approach 2: Deep Q Network��������������������������������������������������������������������91 Results and Discussion ���������������������������������������������������������������������������������������93 Conclusion ����������������������������������������������������������������������������������������������������������94 Chapter 5: Custom OpenAI Reinforcement Learning Environments �����������������������������������������������������������������������95 Overview of Sonic the Hedgehog ������������������������������������������������������������������������95 Downloading the Game ���������������������������������������������������������������������������������������96 Writing the Code for the Environment �����������������������������������������������������������������98 A3C Actor-Critic �������������������������������������������������������������������������������������������������103 Conclusion ��������������������������������������������������������������������������������������������������������111 Appendix A: Source Code ������������������������������������������������������������������113 Market Making Model Utilities ��������������������������������������������������������������������������113 Policy Gradient Utilities �������������������������������������������������������������������������������������115 Models ��������������������������������������������������������������������������������������������������������������116 Chapter 1 ����������������������������������������������������������������������������������������������������������125 OpenAI Example ������������������������������������������������������������������������������������������125 Chapter 2 ����������������������������������������������������������������������������������������������������������125 Cart Pole Example ���������������������������������������������������������������������������������������125

Super Mario Example ����������������������������������������������������������������������������������������130 Chapter 3 ����������������������������������������������������������������������������������������������������������134 Frozen Lake Example ����������������������������������������������������������������������������������134 Doom Example ��������������������������������������������������������������������������������������������������139 Chapter 4 ����������������������������������������������������������������������������������������������������������146 Market Making Example ������������������������������������������������������������������������������146 Chapter 5 ����������������������������������������������������������������������������������������������������������158 Sonic Example ���������������������������������������������������������������������������������������������158 Index �������������������������������������������������������������������������������������������������165

Introduction It is a pleasure to return for a third title with Apress! This text will be the most complex of those I have written, but will be a worthwhile addition to every data scientist and engineer’s library. The field of reinforcement learning has undergone significant change in the past couple of years, and it is worthwhile for everyone excited with artificial intelligence to engross themselves in. As the frontier of artificial intelligence research, this will be an excellent starting point to familiarize yourself with the status of the field as well as the most commonly used techniques. From this point, it is my hope that you will feel empowered to continue on your own research and innovate in your own respective fields.

CHAPTER 1 Introduction to Reinforcement Learning To those returning from my previous books, Introduction to Deep Learning Using R1 and Applied Natural Learning Using Python,2 it is a pleasure to have you as readers again. To those who are new, welcome! Over the past year, there have continued to be an increased proliferation and development of Deep Learning packages and techniques that revolutionize various industries. One of the most exciting portions of this field, without a doubt, is Reinforcement Learning (RL). This itself is often what underlies a lot of generalized AI applications, such as software that learns to play video games or play chess. The benefit to reinforcement learning is that the agent can familiarize itself with a large range of tasks assuming that the problems can be modeled to a framework containing actions, an environment, an agent(s). Assuming that, the range of problems can be from solving simple games, to more complex 3D games, to teaching self-driving cars how to pick up and drop off passengers in a 1 New York: Apress, 2018. 2 New York: Apress, 2017.

2 variety of different places as well as teaching a robotic arm how to grasp objects and place them on top of a kitchen counter. The implications of well-trained and deployed RL algorithms are huge, as they more specifically seek to drive artificial intelligence outside of some of the narrow AI applications spoken about in prior texts I have written. No longer is an algorithm simply predicting a target or label, but instead is manipulating an agent in an environment, and that agent has a set of actions it can choose to achieve a goal/reward. Examples of firms and organizations which devote much time to researching Reinforcement Learning are Deep Mind as well as OpenAI, whose breakthroughs in the field are among the leading solutions. However, let us give a brief overview of the history of the field itself. History of Reinforcement Learning Reinforcement Learning in some sense is a rebranding of optimal control, which is a concept extending from control theory. Optimal control has its origins in the 1950s and 1960s, where it was used to describe a problem where one is trying to achieve a certain “optimal” criterion and what “control” law is needed to achieve this end. Typically, we define an optimal control as a set of differential equations. These equations then define a path toward values that minimize the value of the error function. The core of optimal control is the culmination of Richard Bellman’s work, specifically that of dynamic programming. Developed in the 1950s, dynamic programming is an optimization method that emphasizes the solving of a large individual problem by breaking it down into smaller and easier-to-solve components. It is also considered the only feasible method of solving stochastic optimal control problems and moreover consider in general all of optimal control to be reinforcement learning. Chapter 1 IntroduCtIon to reInforCement LearnIng

3 Bellman’s most notable contribution to optimal control is that of the Hamilton-Jacobi-Bellman (HJB) equation. The HJB equation V x t V x t F x u C x u u , , , ,( ) + Ñ ( ) ( ) + ( ){ } =×min ,0 s t ,. . V x T D X( ) = ( ) where V x t,( ) = the partial derivate of V w.r.t. the time variable t. a · b, V x t,( ) = Bellman value function (unknown scalar) or the cost incurred from starting in state x at time t and controlling the system optimally until time T, C = the scalar cost rate function, D = final utility state function, x(t) = system state vector, x(0) = an assumed given, u(t) for 0 ≤ t ≤ T. The solution yielded from this equation is the value function, or the minimum cost for a given dynamic system. The HJB equation is the standard method by which one solves an optimal control problem. Furthermore, dynamic programming is generally the only feasible way or method for solving stochastic optimal control problems. One of these problems, which dynamic programming was developed to help solve, is Markov decision processes (MDPs). MDPs and their Relation to  Reinforcement Learning We describe MDPs as discrete time stochastic control process. Specifically, we define discrete time stochastic processes as a random process in which the index variable is characterized by a set of discrete, or specific, values (in contrast to continuous values). MDPs are specifically useful for situations in which outcomes are partially affected by participants in the process but the process also exhibits some degree of randomness as well. MDPs and dynamic programming thus become the basis of reinforcement learning theory. Chapter 1 IntroduCtIon to reInforCement LearnIng

4 Plainly stated, we assume based on a Markov property that the future is independent of the past given the present. In addition to this, this state is considered sufficient if it gives us the same description of the future as if we have the entirety of the historical information. This in essence means that the current state is the only piece of information that will be relevant and that all historical information is no longer necessary. Mathematically, a state is said to have the Markov property iff P S S P S S St t t t+[ ] = + ¼1 1 1| [ | , , ] Markov processes themselves are considered to be memory-less, in that they are random transitions from state to state. Furthermore, we consider them to be a tuple (S, P) on a state space S where states change via a transition function P, defined as the following: P S s S sss t t¢ += = =¢ [ |1 ], where S = Markov state, St = next state. This transition function describes a probability distribution, where the distribution is the entirety of the possible states that agent can transition to. Finally, we have a reward that we receive from moving from one state to another, which we define mathematically as the following: R R S S G R R R R s t t t t t t k t k = = = + + + + + + + + - + [ |1 1 2 2 3 1 ], g g g where γ = discount factor, γ ∈ [0, 1], Gt = total discounted rewards, R = reward function. We therefore define a Markov reward process (MRP) tuple as (S, P, R, γ). With all of these formulae now described, the image in Figure 1-1 is an example of a Markov decision process visualized. Chapter 1 IntroduCtIon to reInforCement LearnIng

5 Figure 1-1 shows how an agent can, with varying probability, move from one state to another, receiving a reward. Optimally, we would learn to choose the process that accumulated the most rewards in a given episode before we failed given the parameters of the environment. This, in essence, is a very basic explanation of reinforcement learning. Another important component of the development of Reinforcement Learning was trial and error learning, which was one method of studying animal behavior. Most specifically, this has proven useful for understanding basic reward and punishment mechanisms that “reinforce” different behaviors. The words “Reinforcement Learning” however would not appear until the 1960s. During this period, the idea of the “credit-assignment problem” (cap) would be introduced, specifically by Marvin Minsky. Minsky was a cognitive scientist who devoted much of his lifetime to artificial intelligence, such as his book Perceptrons (1969) and the paper in which he describes the credit assignment problem, “Steps Toward Artificial Intelligence” (1961). The cap asks how does one distribute “credit” for success with respect to all the decisions that were Figure 1-1. Markov Decision Process Chapter 1 IntroduCtIon to reInforCement LearnIng

6 made in achieving that success. Specifically, many reinforcement learning algorithms are directly devoted to solving this precise problem. With this being stated, however, trial and error learning largely became less popular, as neural network methods (and supervised learning in general) such as innovations forwarded by Bernard Widrow and Ted Hoff took up most of the interest within the field of AI. However, a resurgence of interest in the field is most notable in the 1980s, when temporal difference (TD) learning truly takes wind as well as with the development of Q learning. TD learning specifically was influenced by, ironically, another aspect of animal psychology that Minsky pointed out as being important. It comes from the idea of two stimuli, a primary Reinforcer that becomes paired with a secondary Reinforcer and subsequently influences behavior. TD learning itself, however, was largely developed by Richard S. Sutton. He is considered to be one of the most influential figures in the field of RL as his doctoral thesis introduced the idea of temporal credit assignment. This refers to how rewards, particularly in very granular state-action spaces, can be delayed. For example, winning a game of chess requires many actions before one has achieved the “reward” of winning the game. As such, reward signals do not have significant effect on temporally distant states. As such, temporal credit assignment solves for how you reward these granular actions in such a way that meaningfully affect temporally distant states. Q learning, named for the “Q” function that yields the reward, builds on some of these innovations and focuses on finite Markov decision processes. With Q learning, this brings us to the present day, where further improvements on reinforcement learning are continually being made and represent the bleeding edge of AI. With this overview being complete, however, let us more specifically discuss what readers can be expected to learn. Chapter 1 IntroduCtIon to reInforCement LearnIng

7 Reinforcement Learning Algorithms and RL Frameworks Reinforcement learning analogously is very similar to the domain of supervised learning within traditional machine learning, although there are key differences. In supervised learning, there is an objective answer that we are training the model to predict correctly, whether that is a class label or a particular value, based on the input features from a given observation(s). Features are analogous to the vectors within the given state of an environment, which we feed to the reinforcement learning algorithm typically either as a series of states or individually from one state to the next. However, the main difference is that there is not necessarily always one “answer” to solve the particular problem, in that there are possibly multiple ways by which a reinforcement learning algorithm could successfully solve a problem. In this instance, we obviously want to choose the answer that we can arrive at quickest that simultaneously solves the problem in as efficient a manner as possible. This is precisely where our choice of model becomes critical. In the prior overview of the history of RL, we introduced several theorems which you will be walked through in detail in the following chapters. However, being that this is an applied text, theory must also be supplied alongside examples. As such, we will be spending a significant amount of time in this text discussing the RL framework OpenAI Gym and how it interfaces with different Deep Learning Frameworks. OpenAI Gym is a framework that allows us to easily deploy, compare, and test Reinforcement Learning algorithms. However, it does have a great degree of flexibility, in that we can utilize Deep Learning methods alongside OpenAI gym, which we will do in our various proofs of concepts. The following shows some simple example code that utilizes the package and the plot that shows the video yielded from the training process (Figure 1-2). Chapter 1 IntroduCtIon to reInforCement LearnIng

8 import gym def cartpole(): environment = gym.make('CartPole-v1') environment.reset() for _ in range(50): environment.render() action = environment.action_space.sample() observation, reward, done, info = environment. step(action) print("Step {}:".format(_)) print("action: {}".format(action)) print("observation: {}".format(observation)) print("reward: {}".format(reward)) print("done: {}".format(done)) print("info: {}".format(info)) When reviewing the code, we notice that when working with gym, we must initialize an environment in which our algorithms sit. Although it is common to work with environments provided by the package, we can also create our own environments for custom tasks (like video games not provided by gym). Moving forward however, let us discuss the other variables defined worth noting as shown from the terminal output as follows. Figure 1-2. Cart Pole Video Game Chapter 1 IntroduCtIon to reInforCement LearnIng

9 action: 1 observation: [-0.02488139 0.00808876 0.0432061 0.02440099] reward: 1.0 done: False info: {} The variables can be broken down as follows: • Action – Refers to action taken by the agent within an environment that subsequently yields a reward • Reward – Yielded to the agent. Indicates the quality of action with respect to accomplishing some goal • Observation – Yielded by the action: Refers to the state of the environment after an action has been performed • Done – Boolean that indicates whether the environment needs to be reset • Info – Dictionary with miscellaneous information for debugging The process flow that describes the actions is shown in Figure 1-3. Figure 1-3. Process Flow of RL Algorithm and Environment Chapter 1 IntroduCtIon to reInforCement LearnIng

10 To provide more context, Figure 1-2 shows a cart and a pole video game, where the objective is to successfully balance the cart and the pole such that the pole never tilts over. As such, a reasonable objective would be to train some DL or ML algorithm such that we can do this. We will tackle this particular problem later in the book however. The purpose of this section is just to briefly introduce OpenAI Gym. Q Learning We briefly discussed Q learning in the introduction; however, it is worthwhile to highlight the significant portion of this text we will utilize to discuss this topic. Q learning is characterized by the fact that there is some police, which informs an agent of the actions to take in different scenarios. While it does not require a model, we can use one, and it specifically is often applied for finite Markov decision processes. Specifically, the variants we will tackle in this text are Q learning, Deep Q Learning (DQL), and Double Q Learning (Figure 1-4). Chapter 1 IntroduCtIon to reInforCement LearnIng

11 We will discuss this more in depth in the chapters that specifically reference these techniques; however, Q learning and Deep Q Learning each have respective advantages given the complexity of the problem, while both often suffering from similar downfalls. Actor-Critic Models The most advanced of the models we will be tackling in this book are the Actor-Critic Models, which are comprised of the A2C and A3C. Both of these respectively stand for Advantage Actor-Critic and Asynchronous Advantage Actor-Critic models. While both of these are virtually the same, the difference is that the latter has multiple models that work alongside each other and update the parameters independently while the former updates its parameters for all of the models simultaneously. These models update on a more granular basis (action to action) rather than in an episodic manner as many of the other Reinforcement Learning algorithms do. Figure 1-5 shows an example of the Actor-Critic Models visualized. Figure 1-4. Q Learning Flow Chart Chapter 1 IntroduCtIon to reInforCement LearnIng

12 Applications of Reinforcement Learning After the reader has been thoroughly introduced to the concepts of reinforcement learning, we will tackle multiple problems where the focus will be showing the reader how to deploy solutions that we will be training and utilizing on cloud environments. Classic Control Problems Being that the field of optimal control has been around for roughly the past 60 years, there are a handful of problems that we will begin tackling first that users will see often referenced in other reinforcement learning literature. One of them is the cart pole problem, which is referenced in Figure 1-2. This is a game in which the user is required to try and balance a cart pole using the optimal set of options. Another one of these is shown in Figure 1-6, called Frozen Lake, in which the agent learns how to cross a lake which is frozen without stepping on the ice that would cause the agent to fall through. Figure 1-5. Actor-Critic Models Visualized Chapter 1 IntroduCtIon to reInforCement LearnIng

13 Super Mario Bros. One of the most beloved video games of all time turns out to be one of the best ways to display how the use of reinforcement learning in artificial intelligence can be applied to virtual environments. With the help of the py_nes library, we are able to emulate Super Mario Bros. (Figure 1-7) and then utilize the data from the game such that we can train the model to play the level. We will focus on one level exclusively and will be utilizing AWS resources for this application, giving readers an opportunity to gain experience in this task. Figure 1-6. Frozen Lake Visualized Chapter 1 IntroduCtIon to reInforCement LearnIng

Statistics

Uploader

Applied Reinforcement Learning with Python. With OpenAI Gym, Tensorflow and Keras (Taweh Beysolow) (z-library.sk, 1lib.sk, z-lib.sk)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Blog & Notes

Recommended for You

Statistics

Uploader

Applied Reinforcement Learning with Python. With OpenAI Gym, Tensorflow and Keras (Taweh Beysolow) (z-library.sk, 1lib.sk, z-lib.sk)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Blog & Notes

Recommended for You