Reinforcem ent Lea rning for Fina nce Yves Hilpisch Reinforcement Learning for Finance A Python-Based Introduction
ISBN: 978-1-098-16914-5 US $69.99 CAN $87.99 DATA SCIENCE / MACHINE LE ARNING Dr. Yves Hilpisch is founder and CEO of The Python Quants, a group that focuses on the use of open source technologies for financial data science, AI, asset management, algorithmic trading, and computational finance. He is also director of the Certificate in Python for Finance (CPF) Program. Reinforcement learning (RL) has led to several breakthroughs in AI. The use of the deep Q-learning (DQL) algorithm alone has helped people develop agents that play arcade games and board games at a superhuman level. More recently, RL, DQL, and similar methods have gained popularity in publications related to financial research. This book is among the first to explore the use of reinforcement learning methods in finance. Author Yves Hilpisch, founder and CEO of The Python Quants, provides the background you need in concise fashion. ML practitioners, financial traders, portfolio managers, strategists, and analysts will focus on the implementation of these algorithms in the form of self-contained Python code and the application to important financial problems. This book covers: • Reinforcement learning • Deep Q-learning • Actor-critic algorithm • Python implementations of these algorithms • How to apply the algorithms to financial problems such as algorithmic trading, dynamic hedging, and dynamic asset allocation This book is the ideal reference on this topic. You’ll read it once, change the examples according to your needs or ideas, and refer to it whenever you work with RL for finance. Reinforcement Learning for Finance “Reinforcement Learning for Finance is an indispensable resource for anyone eager to learn and apply RL in real-world finance. The book expertly bridges the gap between theory and practice, offering clear explanations alongside detailed Python code. It’s a must-read for students, academics, and practitioners looking to deepen and enhance their technical expertise in this cutting-edge field.” Ivilina Popova Professor of Finance, Texas State University
Yves Hilpisch Reinforcement Learning for Finance A Python-Based Introduction
978-1-098-16914-5 [LSI] Reinforcement Learning for Finance by Yves Hilpisch Copyright © 2025 Yves Hilpisch. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institu‐ tional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Michelle Smith Indexer: Judith McConville Development Editor: Corbin Collins Interior Designer: David Futato Production Editor: Beth Kelly Cover Designer: Karen Montgomery Copyeditor: Doug McNair Illustrator: Kate Dullea Proofreader: Heather Walley October 2024: First Edition Revision History for the First Edition 2024-10-14: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098169145 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Reinforcement Learning for Finance, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Part I. The Basics 1. Learning Through Interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Bayesian Learning 3 Tossing a Biased Coin 4 Rolling a Biased Die 7 Bayesian Updating 9 Reinforcement Learning 11 Major Breakthroughs 12 Major Building Blocks 14 Deep Q-Learning 16 Conclusions 17 References 17 2. Deep Q-Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Decision Problems 20 Dynamic Programming 21 Q-Learning 24 CartPole as an Example 26 The Game Environment 26 A Random Agent 28 The DQL Agent 29 Q-Learning Versus Supervised Learning 34 Conclusions 34 References 35 iii
3. Financial Q-Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Finance Environment 37 DQL Agent 43 Where the Analogy Fails 45 Limited Data 45 No Impact 46 Conclusions 48 References 48 Part II. Data Augmentation 4. Simulated Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Noisy Time Series Data 52 Simulated Time Series Data 56 Conclusions 62 References 63 DQLAgent Python Class 64 5. Generated Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Simple Example 68 Financial Example 73 Kolmogorov-Smirnov Test 78 Conclusions 80 References 81 Part III. Financial Applications 6. Algorithmic Trading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Prediction Game Revisited 86 Trading Environment 89 Trading Agent 94 Conclusions 97 References 98 Finance Environment 98 DQLAgent Class 100 Simulation Environment 102 7. Dynamic Hedging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Delta Hedging 106 Hedging Environment 115 iv | Table of Contents
Hedging Agent 121 Conclusions 126 References 127 BSM (1973) Formula 127 8. Dynamic Asset Allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Two-Fund Separation 130 Two-Asset Case 146 Three-Asset Case 154 Equally Weighted Portfolio 160 Conclusions 161 References 161 Three-Asset Code 162 9. Optimal Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 The Model 168 Model Implementation 170 Execution Environment 176 Random Agent 179 Execution Agent 181 Conclusions 187 References 188 10. Concluding Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 References 191 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Table of Contents | v
(This page has no text content)
Preface Tell me and I forget. Teach me and I remember. Involve me and I learn. —Benjamin Franklin Reinforcement learning (RL) has enabled a number of breakthroughs in AI. One of the key algorithms in RL is deep Q-learning (DQL) that can be applied to a large number of dynamic decision problems. Popular examples are arcade games and board games, such as Go, in which RL and DQL algorithms have achieved superhu‐ man performance in many instances. This has often happened despite the belief of experts that such feats would be impossible for decades to come. Finance is a discipline with a strong connection between theory and practice. Theo‐ retical advancements often find their way quickly into the applied domain. Many problems in finance are dynamic decision problems, such as the optimal allocation of assets over time. Therefore it is, on the one hand, theoretically interesting to apply DQL to financial problems. On the other hand, it is also in general quite easy and straightforward to apply such algorithms—usually after some thorough testing—in the financial markets. In recent years, financial research has seen a strong growth in publications related to RL, DQL, and related methods applied to finance. However, there is hardly any resource in book form—beyond the purely theoretical ones—for those who are look‐ ing for an applied introduction to this exciting field. This book closes the gap in that it provides the required background in a concise fashion and otherwise focuses on the implementation of the algorithms in the form of self-contained Python code and the application to important financial problems. Target Audience This book is intended as a concise, Python-based introduction to the major ideas and elements of RL and DQL as applied to finance. It should be useful to both students and academics as well as to practitioners in search of alternatives to existing financial vii
theories and algorithms. The book expects basic knowledge of the Python program‐ ming language, object-oriented programming, and the major Python packages used in data science and machine learning, such as NumPy, pandas, matplotlib, scikit- learn, and TensorFlow. Overview of the Book The book consists of the following chapters: Chapter 1 The first chapter focuses on learning through interaction with four major exam‐ ples: probability matching, Bayesian updating, RL, and DQL. Chapter 2 The second chapter introduces concepts from dynamic programming (DP) and discusses DQL as an approach to approximate solutions to DP problems. The major theme is the derivation of optimal policies to maximize a given objective function through taking a sequence of actions and updating the optimal policy iteratively. DQL is illustrated on the basis of a DQL agent that learns to play the CartPole game from the Gymnasium Python package. Chapter 3 The third chapter develops a first Finance environment that allows the DQL agent from Chapter 2 to learn a financial prediction game. Although the environ‐ ment formally replicates the API of the CartPole game, it misses some important characteristics that are needed to apply RL successfully. Chapter 4 The fourth chapter is about data augmentation based on Monte Carlo simulation (MCS) approaches, and it discusses the addition of noise to historical data and the simulation of stochastic processes. Chapter 5 The fifth chapter introduces generative adversarial networks (GANs) to syntheti‐ cally generate time series data that has statistical characteristics that are similar to those of historical time series data on which a GAN was trained. Chapter 6 Building on the example from Chapter 3, this chapter applies DQL to the prob‐ lem of algorithmic trading based on the prediction of the next price movement’s direction. viii | Preface
Chapter 7 The seventh chapter is about learning optimal dynamic hedging strategies for an option with European exercise in the Black-Scholes-Merton (1973) model. In other words, delta hedging or dynamic replication of the option is the goal. Chapter 8 This chapter applies DQL to three canonical examples in asset management: one risky asset and one risk-free asset, two risky assets, and three risky assets. The problem is to dynamically allocate funds to the available assets to maximize a profit target or a risk-adjusted return (Sharpe ratio). Chapter 9 The ninth chapter is about the optimal liquidation of a large position in a stock. Given a certain risk aversion, the total execution costs are to be minimized. This use case differs from the others in that all actions are tightly connected with each other through an additional constraint. The chapter also introduces an additional RL algorithm in the form of an actor-critic implementation. Chapter 10 The final chapter of the book provides some concluding remarks and sketches out how the examples presented in the book can be improved upon. About the Code in This Book The code in this book is primarily developed using TensorFlow 2.13. Readers can run the code directly on The Python Quants’ Quant Platform with no additional installa‐ tions required—only a free registration. This platform allows readers to effortlessly execute the code and reproduce the results as presented in the book. The code is also available for download to run locally. Future updates, such as support for newer TensorFlow versions, are planned. Additionally, the Quant Platform offers access to a user forum where readers can ask questions and receive support on all topics related to the book. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Preface | ix
Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or with values determined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://rl4f.pqp.io. If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. x | Preface
We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example, this book would be attributed as “Reinforcement Learning for Finance by Yves Hilpisch (O’Reilly). Copy‐ right 2025 Yves Hilpisch, 978-1-098-16914-5.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-827-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/RL-for-finance. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Watch us on YouTube: https://youtube.com/oreillymedia Preface | xi
Acknowledgments The contents of this book evolved through a series of online webinars, classes within the CPF Program, and workshops at conferences across Europe and the USA. I extend my sincere thanks to all participants whose valuable feedback helped shape the final version of this work. A special thank you goes to Dr. Ivilina Popova for her insightful feedback on the financial sections and the book as a whole. Her contributions were instrumental in refining the content. I am also grateful to the entire O’Reilly team for their professionalism and ongoing support. Their constructive input and thoughtful sug‐ gestions led to significant improvements throughout the manuscript. This book is dedicated to Sandra and Henry. To Sandra, for her unwavering love and support throughout this journey. To Henry, with the hope that this work will inspire him in his studies of data science and artificial intelligence, and fuel his passion for learning. xii | Preface
PART I The Basics The first part of the book covers the basics of reinforcement learning and provides background information. It consists of three chapters: • Chapter 1 focuses on learning through interaction with four major examples: probability matching, Bayesian updating, reinforcement learning (RL), and deep Q-learning (DQL). • Chapter 2 introduces concepts from dynamic programming (DP) and discusses DQL as an approach to approximate solutions to DP problems. The major theme is the derivation of optimal policies to maximize a given objective function through taking a sequence of actions and updating the optimal policy iteratively. DQL is illustrated based on the CartPole game from the Gymnasium Python package. • Chapter 3 develops a first Finance environment that allows the DQL agent from Chapter 2 to learn a financial prediction game. Although the environment for‐ mally replicates the API of the CartPole, it misses some important characteristics that are needed to apply RL successfully.
(This page has no text content)
CHAPTER 1 Learning Through Interaction The idea that we learn by interacting with our environment is probably the first to occur to us when we think about the nature of learning. —Sutton and Barto (2018) For human beings and animals alike, learning is almost as fundamental as breathing. It is something that happens continuously and most often unconsciously. There are different forms of learning. The one most important to the topics covered in this book is based on interacting with an environment. Interaction with an environment provides the learner—or agent henceforth—with feedback that can be used to update their knowledge or to refine a skill. In this book, we are mostly interested in learning quantifiable facts about an environment, such as the odds of winning a bet or the reward that an action yields. The next section discusses Bayesian learning as an example of learning through inter‐ action. “Reinforcement Learning” on page 11 presents breakthroughs in AI that were made possible through RL. It also describes the major building blocks of RL. “Deep Q-Learning” on page 16 explains the two major characteristics of DQL, which is the most important algorithm in the remainder of the book. Bayesian Learning Two examples illustrate learning by interacting with an environment: tossing a biased coin and rolling a biased die. The examples are based on the idea that an agent bet‐ ting repeatedly on the outcome of a biased gamble (and remembering all outcomes) can learn bet-by-bet about a gamble’s bias and thereby about the optimal policy for betting. The idea, in that sense, makes use of Bayesian updating. Bayes’ theorem and Bayesian updating date back to the 18th century (Bayes and Price 1763). A modern and Python-based discussion of Bayesian statistics is found in Downey (2021). 3
Tossing a Biased Coin Assume the simple game of betting on the outcome of tossing a biased coin. As a benchmark, consider the special case of an unbiased coin first. Agents are allowed to bet for free on the outcome of the coin tosses. An agent might, for example, bet ran‐ domly on either heads or tails. The reward is 1 USD if the agent wins and nothing if the agent loses. The agent’s goal is to maximize the total reward. The following Python code simulates several sequences of 100 bets each: In [1]: import numpy as np from numpy.random import default_rng rng = default_rng(seed=100) In [2]: ssp = [1, 0] In [3]: asp = [1, 0] In [4]: def epoch(): tr = 0 for _ in range(100): a = rng.choice(asp) s = rng.choice(ssp) if a == s: tr += 1 return tr In [5]: rl = np.array([epoch() for _ in range(250)]) rl[:10] Out[5]: array([56, 47, 48, 55, 55, 51, 54, 43, 55, 40]) In [6]: rl.mean() Out[6]: 49.968 The state space, 1 for heads and 0 for tails The action space, 1 for a bet on heads and 0 for one on tails The random bet The random coin toss The reward for a winning bet The simulation of multiple sequences of bets The average total reward The average total reward in this benchmark case is close to 50. The same result might be achieved by solely betting on either heads or tails. 4 | Chapter 1: Learning Through Interaction
Assume now that the coin is biased so that heads prevails in 80% of the coin tosses. Betting solely on heads would yield an average total reward of about $80 for 100 bets. Betting solely on tails would yield an average total reward of about $20. But what about the random betting strategy? The following Python code simulates this case: In [7]: ssp = [1, 1, 1, 1, 0] In [8]: asp = [1, 0] In [9]: def epoch(): tr = 0 for _ in range(100): a = rng.choice(asp) s = rng.choice(ssp) if a == s: tr += 1 return tr In [10]: rl = np.array([epoch() for _ in range(250)]) rl[:10] Out[10]: array([53, 56, 40, 55, 53, 49, 43, 45, 50, 51]) In [11]: rl.mean() Out[11]: 49.924 The biased state space The same action space as before Although the coin is now highly biased, the average total reward of the random bet‐ ting strategy is about the same as in the benchmark case. This might sound counter‐ intuitive. However, the expected win rate is given by 0.8 · 0.5 + 0.2 · 0.5 = 0.5. In words, when betting on heads, the win rate is 80%, and when betting on tails, it is 20%. Together, the total reward is as before, on average. As a consequence, without learning, the agent is not able to capitalize on the bias. A learning agent, on the other hand, can gain an edge by basing the betting strategy on the previous outcomes they observe. To this end, it is already enough to record all observed outcomes and to choose randomly from the set of all previous outcomes. In this case, the bias is reflected in the number of times the agent randomly bets on heads as compared to tails. The Python code that follows illustrates this simple learn‐ ing strategy: In [12]: ssp = [1, 1, 1, 1, 0] In [13]: def epoch(n): tr = 0 asp = [0, 1] for _ in range(n): a = rng.choice(asp) Bayesian Learning | 5
s = rng.choice(ssp) if a == s: tr += 1 asp.append(s) return tr In [14]: rl = np.array([epoch(100) for _ in range(250)]) rl[:10] Out[14]: array([71, 65, 67, 69, 68, 72, 68, 68, 77, 73]) In [15]: rl.mean() Out[15]: 66.78 The initial action space The update of the action space with the observed outcome With remembering and learning, the agent achieves an average total reward of about $66.80—a significant improvement over the random strategy without learning. This is close to the expected value of (0.82 + 0.22) · 100 = 68. This strategy, while not optimal, is regularly observed in experiments involving human beings—and, maybe somewhat surprisingly, in animals as well. It is called probability matching. On the other hand, the agent can do better by simply betting on the most likely out‐ come as derived from past results. The following Python code implements this strategy: In [16]: from collections import Counter In [17]: ssp = [1, 1, 1, 1, 0] In [18]: def epoch(n): tr = 0 asp = [0, 1] for _ in range(n): c = Counter(asp) a = c.most_common()[0][0] s = rng.choice(ssp) if a == s: tr += 1 asp.append(s) return tr In [19]: rl = np.array([epoch(100) for _ in range(250)]) rl[:10] Out[19]: array([81, 70, 74, 77, 82, 74, 81, 80, 77, 78]) In [20]: rl.mean() Out[20]: 78.828 6 | Chapter 1: Learning Through Interaction
Comments 0
Loading comments...
Reply to Comment
Edit Comment