9 780262 048644 59000 US $90.00 / CAN $119.00 ISBN 978-0-262-04864-4 MULTI-AGENT REINFORCEMENT LE ARNING FOUNDATIONS AND MODERN APPROACHE S Stefano V. Albrecht Filippos Christianos Lukas Schäfer “marl-book-main” — 2024/9/19 — 14:07 — page 1 — #1 1
“marl-book-main” — 2024/9/19 — 14:07 — page a — #2 Praise for Multi-Agent Reinforcement Learning “The authors meticulously bring reinforcement learning together with game theory to provide a foundation for research and application of multi-agent reinforcement learning. This book is the perfect starting point for a grounding in the field.” – Andrew Barto, Professor Emeritus, University of Massachusetts Amherst “Multi-agent reinforcement learning is well positioned to be the next white-hot area of artificial intelligence and this book provides the essential background, concepts, and insights for understanding this exciting and important area of research.” – Michael L. Littman, Professor of Computer Science, Brown University “This book is the first complete reference for the growing area of multi-agent rein- forcement learning. It provides both an essential resource for newcomers to the field; and a valuable perspective for established researchers.” – Peter Stone, Professor of Computer Science, The University of Texas at Austin “A landmark textbook to multiagent reinforcement learning, combining game- theoretic foundations with state-of-the-art deep learning. This essential textbook delivers fundamental insights for newcomers, experts and practitioners, featuring real-world applications and advanced algorithms.” – Karl Tuyls, Professor of Computer Science, University of Liverpool “This will become the standard text of the emerging field of multiagent reinforcement learning. It builds from foundational ideas, incorporating recent breakthroughs in deep learning. This book will help accelerate theoretical and practical progress.” – Mykel J. Kochenderfer, Professor of Aeronautics and Astronautics, Stanford University
“marl-book-main” — 2024/9/19 — 14:07 — page b — #3
“marl-book-main” — 2024/9/19 — 14:07 — page i — #4 Multi-Agent Reinforcement Learning
“marl-book-main” — 2024/9/19 — 14:07 — page ii — #5
“marl-book-main” — 2024/9/19 — 14:07 — page iii — #6 Multi-Agent Reinforcement Learning Foundations and Modern Approaches Stefano V. Albrecht, Filippos Christianos, and Lukas Schäfer The MIT Press Cambridge, Massachusetts London, England
“marl-book-main” — 2024/9/19 — 14:07 — page iv — #7 © 2024 Massachusetts Institute of Technology This work is subject to a Creative Commons CC-BY-NC-ND license. This license applies only to the work in full and not to any components included with permission. Subject to such license, all rights are reserved. No part of this book may be used to train artificial intelligence systems without permission in writing from the MIT Press. The MIT Press would like to thank the anonymous peer reviewers who provided comments on drafts of this book. The generous work of academic experts is essential for establishing the authority and quality of our publications. We acknowledge with gratitude the contributions of these otherwise uncredited readers. This book was set in Times New Roman by Stefano V. Albrecht, Filippos Christianos, and Lukas Schäfer. Printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Data Names: Albrecht, Stefano V., author. | Christianos, Filippos, author. | Schäfer, Lukas, author. Title: Multi-agent reinforcement learning : foundations and modern approaches / Stefano V. Albrecht, Filippos Christianos, Lukas Schäfer, The University of Edinburgh, United Kingdom. Description: Cambridge, Massachusetts : The MIT Press, [2024] | Includes bibliographical references and index. Identifiers: LCCN 2024002669 (print) | LCCN 2024002670 (ebook) | ISBN 9780262049375 (hardcover) | ISBN 9780262380508 (epub) | ISBN 9780262380515 (pdf) Subjects: LCSH: Reinforcement learning. | Intelligent agents (Computer software) Classification: LCC Q325.6 .A43 2024 (print) | LCC Q325.6 (ebook) | DDC 006.3/1–dc23/eng/20240412 LC record available at https://lccn.loc.gov/2024002669 LC ebook record available at https://lccn.loc.gov/2024002670
“marl-book-main” — 2024/9/19 — 14:07 — page v — #8 Für A.H.A. und R.J.A. — träumt groß, strebt nach euren Zielen, geht voraus. For A.E.A. — this book would not be here without you. –S.V.A. Στη Δάφνη — για την αστείρευτη αγάπη και την υποστήριξη της στη συγγραφή ενός βιβλίου που πιθανώς να μην διαβάσει ποτέ. –F.C. Für Jonas, Daniel, Annette und Stefan — für eure Liebe, Unterstützung und die Neugierde, die ihr in mir geweckt habt. –L.S.
“marl-book-main” — 2024/9/19 — 14:07 — page vi — #9
“marl-book-main” — 2024/9/19 — 14:07 — page vii — #10 Contents Summary of Notation xiii List of Figures xvii Preface xxiii 1 Introduction 1 1.1 Multi-Agent Systems 2 1.2 Multi-Agent Reinforcement Learning 6 1.3 Application Examples 9 1.3.1 Multi-Robot Warehouse Management 9 1.3.2 Competitive Play in Board Games and Video Games 10 1.3.3 Autonomous Driving 11 1.3.4 Automated Trading in Electronic Markets 11 1.4 Challenges of MARL 12 1.5 Agendas of MARL 13 1.6 Book Contents and Structure 15 I FOUNDATIONS OF MULTI-AGENT REINFORCEMENT LEARNING 17 2 Reinforcement Learning 19 2.1 General Definition 20 2.2 Markov Decision Processes 22 2.3 Expected Discounted Returns and Optimal Policies 24 2.4 Value Functions and Bellman Equation 26 2.5 Dynamic Programming 29 2.6 Temporal-Difference Learning 32 2.7 Evaluation with Learning Curves 36
“marl-book-main” — 2024/9/19 — 14:07 — page viii — #11 viii Contents 2.8 Equivalence ofR(s, a, s′) andR(s, a) 39 2.9 Summary 40 3 Games: Models of Multi-Agent Interaction 43 3.1 Normal-Form Games 44 3.2 Repeated Normal-Form Games 46 3.3 Stochastic Games 47 3.4 Partially Observable Stochastic Games 49 3.4.1 Belief States and Filtering 53 3.5 Modeling Communication 55 3.6 Knowledge Assumptions in Games 56 3.7 Dictionary: Reinforcement Learning↔ Game Theory 58 3.8 Summary 58 4 Solution Concepts for Games 61 4.1 Joint Policy and Expected Return 62 4.2 Best Response 65 4.3 Minimax 65 4.3.1 Minimax Solution via Linear Programming 67 4.4 Nash Equilibrium 68 4.5 ϵ-Nash Equilibrium 70 4.6 (Coarse) Correlated Equilibrium 71 4.6.1 Correlated Equilibrium via Linear Programming 74 4.7 Conceptual Limitations of Equilibrium Solutions 75 4.8 Pareto Optimality 76 4.9 Social Welfare and Fairness 78 4.10 No-Regret 81 4.11 The Complexity of Computing Equilibria 83 4.11.1 PPAD Complexity Class 84 4.11.2 Computing ϵ-Nash Equilibrium Is PPAD-Complete 86 4.12 Summary 87 5 Multi-Agent Reinforcement Learning in Games: First Steps and Challenges 89 5.1 General Learning Process 90 5.2 Convergence Types 92 5.3 Single-Agent RL Reductions 95 5.3.1 Central Learning 95 5.3.2 Independent Learning 97 5.3.3 Example: Level-Based Foraging 99
“marl-book-main” — 2024/9/19 — 14:07 — page ix — #12 Contents ix 5.4 Challenges of MARL 101 5.4.1 Non-Stationarity 102 5.4.2 Equilibrium Selection 104 5.4.3 Multi-Agent Credit Assignment 106 5.4.4 Scaling to Many Agents 108 5.5 What Algorithms Do Agents Use? 109 5.5.1 Self-Play 110 5.5.2 Mixed-Play 111 5.6 Summary 112 6 Multi-Agent Reinforcement Learning: Foundational Algorithms 115 6.1 Dynamic Programming for Games: Value Iteration 116 6.2 Temporal-Difference Learning for Games: Joint-Action Learning 118 6.2.1 Minimax Q-Learning 121 6.2.2 Nash Q-Learning 123 6.2.3 Correlated Q-Learning 124 6.2.4 Limitations of Joint-Action Learning 125 6.3 Agent Modeling 127 6.3.1 Fictitious Play 128 6.3.2 Joint-Action Learning with Agent Modeling 131 6.3.3 Bayesian Learning and Value of Information 134 6.4 Policy-Based Learning 140 6.4.1 Gradient Ascent in Expected Reward 141 6.4.2 Learning Dynamics of Infinitesimal Gradient Ascent 142 6.4.3 Win or Learn Fast 145 6.4.4 Win or Learn Fast with Policy Hill Climbing 147 6.4.5 Generalized Infinitesimal Gradient Ascent 149 6.5 No-Regret Learning 151 6.5.1 Unconditional and Conditional Regret Matching 151 6.5.2 Convergence of Regret Matching 153 6.6 Summary 156 II MULTI-AGENT DEEP REINFORCEMENT LEARNING: ALGORITHMS AND PRACTICE 159 7 Deep Learning 161 7.1 Function Approximation for Reinforcement Learning 161 7.2 Linear Function Approximation 163 7.3 Feedforward Neural Networks 165
“marl-book-main” — 2024/9/19 — 14:07 — page x — #13 x Contents 7.3.1 Neural Unit 166 7.3.2 Activation Functions 167 7.3.3 Composing a Network from Layers and Units 168 7.4 Gradient-Based Optimization 169 7.4.1 Loss Function 170 7.4.2 Gradient Descent 171 7.4.3 Backpropagation 174 7.5 Convolutional and Recurrent Neural Networks 175 7.5.1 Learning from Images—Exploiting Spatial Relationships in Data 176 7.5.2 Learning from Sequences with Memory 178 7.6 Summary 180 8 Deep Reinforcement Learning 183 8.1 Deep Value Function Approximation 184 8.1.1 Deep Q-Learning—What Can Go Wrong? 184 8.1.2 Moving Target Problem 187 8.1.3 Breaking Correlations 188 8.1.4 Putting It All Together: Deep Q-Networks 192 8.1.5 Beyond Deep Q-Networks 193 8.2 Policy Gradient Algorithms 195 8.2.1 Advantages of Learning a Policy 195 8.2.2 Policy Gradient Theorem 197 8.2.3 REINFORCE: Monte Carlo Policy Gradient 199 8.2.4 Actor-Critic Algorithms 202 8.2.5 A2C: Advantage Actor-Critic 203 8.2.6 PPO: Proximal Policy Optimization 206 8.2.7 Policy Gradient Algorithms in Practice 209 8.2.8 Concurrent Training of Policies 210 8.3 Observations, States, and Histories in Practice 215 8.4 Summary 216 9 Multi-Agent Deep Reinforcement Learning 219 9.1 Training and Execution Modes 220 9.1.1 Centralized Training and Execution 220 9.1.2 Decentralized Training and Execution 221 9.1.3 Centralized Training with Decentralized Execution 222 9.2 Notation for Multi-Agent Deep Reinforcement Learning 222 9.3 Independent Learning 223 9.3.1 Independent Value-Based Learning 224
“marl-book-main” — 2024/9/19 — 14:07 — page xi — #14 Contents xi 9.3.2 Independent Policy Gradient Methods 226 9.3.3 Example: Deep Independent Learning in a Large Task 228 9.4 Multi-Agent Policy Gradient Algorithms 230 9.4.1 Multi-Agent Policy Gradient Theorem 231 9.4.2 Centralized Critics 232 9.4.3 Centralized Action-Value Critics 236 9.4.4 Counterfactual Action-Value Estimation 237 9.4.5 Equilibrium Selection with Centralized Action-Value Critics 239 9.5 Value Decomposition in Common-Reward Games 242 9.5.1 Individual-Global-Max Property 244 9.5.2 Linear Value Decomposition 246 9.5.3 Monotonic Value Decomposition 249 9.5.4 Value Decomposition in Practice 255 9.5.5 Beyond Monotonic Value Decomposition 261 9.6 Agent Modeling with Neural Networks 266 9.6.1 Joint-Action Learning with Deep Agent Models 267 9.6.2 Learning Representations of Agent Policies 271 9.7 Environments with Homogeneous Agents 274 9.7.1 Parameter Sharing 276 9.7.2 Experience Sharing 278 9.8 Policy Self-Play in Zero-Sum Games 281 9.8.1 Monte Carlo Tree Search 283 9.8.2 Self-Play MCTS 286 9.8.3 Self-Play MCTS with Deep Neural Networks: AlphaZero 288 9.9 Population-Based Training 290 9.9.1 Policy Space Response Oracles 292 9.9.2 Convergence of PSRO 295 9.9.3 Grandmaster Level in StarCraft II: AlphaStar 298 9.10 Summary 301 10 Multi-Agent Deep Reinforcement Learning in Practice 305 10.1 The Agent-Environment Interface 305 10.2 MARL Neural Networks in PyTorch 307 10.2.1 Seamless Parameter Sharing Implementation 309 10.2.2 Defining the Models: An Example with IDQN 310 10.3 Centralized Value Functions 312 10.4 Value Decomposition 313
“marl-book-main” — 2024/9/19 — 14:07 — page xii — #15 xii Contents 10.5 Practical Tips for MARL Algorithms 313 10.5.1 Stacking Time Steps vs. Recurrent Network vs. Neither 314 10.5.2 Standardizing Rewards 314 10.5.3 Centralized Optimization 315 10.6 Presentation of Experimental Results 316 10.6.1 Learning Curves 316 10.6.2 Hyperparameter Search 318 11 Multi-Agent Environments 319 11.1 Criteria for Choosing Environments 320 11.2 Structurally Distinct 2× 2 Matrix Games 321 11.2.1 No-Conflict Games 321 11.2.2 Conflict Games 322 11.3 Complex Environments 323 11.3.1 Level-Based Foraging 324 11.3.2 Multi-Agent Particle Environment 326 11.3.3 StarCraft Multi-Agent Challenge 327 11.3.4 Multi-Robot Warehouse 328 11.3.5 Google Research Football 329 11.3.6 Hanabi 330 11.3.7 Overcooked 331 11.4 Environment Collections 332 11.4.1 Melting Pot 333 11.4.2 OpenSpiel 334 11.4.3 Petting Zoo 335 A Surveys on Multi-Agent Reinforcement Learning 337 References 341 Index 363
“marl-book-main” — 2024/9/19 — 14:07 — page xiii — #16 Summary of Notation Sets are denoted with capital letters. Elements of sets are denoted with lower-case letters. Time index t (or τ ) is shown in superscript (e.g., st denotes state at time t). Agent index is shown in subscript (e.g., ai denotes action of agent i). The most common symbols used in the book are listed below. Specific sections may use additional notation. General R set of real numbers ∝ proportional to x⊤ transpose of a vector x X⊤ transpose of a matrix X Pr probability Pr(x | y) conditional probability of x given y Ep[x] expectation of x under probability distribution p x∼ p x sampled according to probability distribution p x← y assign value y to variable x D training data set ∂f ∂x derivative of function f with respect to x ∇ gradient ⟨a, b, c, ...⟩ concatenation of inputs a, b, c, ... into tuple (a, b, c, ...) [x]1 indicator function: returns 1 if x is true, otherwise returns 0 Game Model I set of agents i, j agent subscripts –i subscript to denote the tuple ⟨all agents except agent i⟩ S, S̄ state space, set of terminal states s state O, Oi (joint-) observation space, observation space of agent i o, oi (joint) observation, observation of agent i A, Ai (joint-) action space, action space of agent i a, ai (joint) action, action of agent i
“marl-book-main” — 2024/9/19 — 14:07 — page xiv — #17 xiv SUMMARY OF NOTATION r, ri (joint) reward, reward of agent i µ initial state distribution T state transition function T̂ simulation/sampling model of state transitions O,Oi observation function (of agent i) R,Ri reward function (of agent i) Γs normal-form game for state s Policies, Returns, Values Π,Πi (joint-) policy space, policy space of agent i π,πi (joint) policy, policy of agent i π∗ optimal policy, or equilibrium joint policy H, Ĥ set of histories, set of full histories h, hi joint-observation history, observation history of agent i ĥ full history containing states, joint observations, joint actions σ(ĥ) function returning joint-observation history from full history ĥ γ discount factor u, ui discounted return (for agent i) U, Ui expected discounted return (for agent i) (Multi-Agent) Reinforcement Learning L learning algorithm α learning rate ϵ exploration rate π̄i empirical action distribution, or averaged policy, of agent i π̂j agent model for agent j BRi set of best-response actions or policies for agent i Vπ , Vπ i state-value function (of agent i) under policy π Qπ , Qπ i action-value function (of agent i) under policy π V∗, Q∗ optimal state/action-value function Valuei return an equilibrium value for agent i in a normal-form game Deep Learning θ network parameters f (x; θ) function f for input x with parameters θ L(θ) loss function over parameters θ B batch of data B batch size, i.e., number of samples in a batch (Multi-Agent) Deep Reinforcement Learning θ, θi value function parameters (of agent i)
“marl-book-main” — 2024/9/19 — 14:07 — page xv — #18 SUMMARY OF NOTATION xv ϕ,ϕi policy parameters (of agent i) θ target network parameters D,Di experience replay buffer (of agent i) H entropy z centralized information, e.g. the state of the environment
“marl-book-main” — 2024/9/19 — 14:07 — page xvi — #19
“marl-book-main” — 2024/9/19 — 14:07 — page xvii — #20 List of Figures 1.1 Schematic of a multi-agent system. 2 1.2 A level-based foraging task. 4 1.3 Schematic of multi-agent reinforcement learning. 6 1.4 Dimensions in multi-agent reinforcement learning. 8 2.1 Definition of a reinforcement learning problem. 20 2.2 Basic reinforcement learning loop for a single-agent system. 21 2.3 Mars Rover Markov decision process (MDP). 23 2.4 Sarsa and Q-learning algorithms in the Mars Rover problem. 37 3.1 Hierarchy of game models. 44 3.2 Three normal-form games with two agents (i.e., matrix games). 46 3.3 Three game models as directed cyclic graphs. 50 3.4 Level-based foraging environment with partial observability. 53 3.5 Synonymous terms in reinforcement learning and game theory. 59 4.1 Definition of a multi-agent reinforcement learning problem. 62 4.2 Matrix game in which ϵ-Nash equilibrium can be far from Nash equilibrium. 71 4.3 Chicken matrix game. 73 4.4 Feasible joint rewards and Pareto frontier in Chicken game. 77 4.5 Feasible joint rewards and fairness-optimal outcomes in Battle of the Sexes game. 80 4.6 Ten episodes between two agents in the non-repeated Prisoner’s Dilemma matrix game. 82 4.7 Instances of END-OF-LINE problem. 85 5.1 General learning process in multi-agent reinforcement learning. 90
Comments 0
Loading comments...
Reply to Comment
Edit Comment