Reliable Machine Learning Applying SRE Principles to ML in Production (Cathy Chen, Niall Richard Murphy, Kranti Parisa etc.) (Z-Library)

Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley & Todd Underwood Foreword by Sam Charrington Reliable Machine Learning Applying SRE Principles to ML in Production C hen, M urp hy, Pa risa , Sculley & U nd erw ood

MACHINE LE ARNING Reliable Machine Learning US $69.99 CAN $87.99 ISBN: 978-1-098-10622-5 Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia Whether you’re part of a small startup or a multinational corporation, this practical book shows data scientists, software and site reliability engineers, product managers, and business owners how to run and establish ML reliably, effectively, and accountably within your organization. You’ll gain insight into everything from how to do model monitoring in production to how to run a well-tuned model development team in a product organization. By applying an SRE mindset to machine learning, authors and engineering professionals Cathy Chen, Kranti Parisa, Niall Richard Murphy, D. Sculley, Todd Underwood, and featured guest authors show you how to run an efficient and reliable ML system. Whether you want to increase revenue, optimize decision making, solve problems, or understand and influence customer behavior, you’ll learn how to perform day- to-day ML tasks while keeping the bigger picture in mind. You’ll examine: • What ML is: how it functions and what it relies on • Conceptual frameworks for understanding how ML “loops” work • How effective productionization can make your ML systems easily monitorable, deployable, and operable • Why ML systems make production troubleshooting more difficult, and how to compensate accordingly • How ML, product, and production teams can communicate effectively “Before you ever put a real system based on machine learning into deployment, you will benefit from reading this book—you can rest assured that it comes from combined decades of hard-won experience.” —Andrew Moore VP and General Manager Google Cloud AI Cathy Chen has served as a technical program manager, product manager, and engineering manager at Google. Niall Richard Murphy is a CEO of a startup in the ML & SRE space, and has worked for Amazon, Google, and Microsoft. Kranti Parisa is vice president and head of product engineering at Dialpad. D. Sculley is CEO of Kaggle and GM of third party ML ecosystems at Google. Todd Underwood is a senior director and founder of machine learning SRE at Google.

Praise for Reliable Machine Learning I don’t care how much data science work you’ve done in the past, or how expert you are on the statistical foundations of machine learning. I don’t care if you have read every line of the Tensorflow source code, or implemented your own distributed ML training from scratch. Before you ever put a real system based on machine learning into deployment, you will benefit from reading this book. This is what is needed for the thousands of upcoming ML deployments where their usefulness is a double-edged sword. The more useful, the higher the stakes around safety, security, paying customers who are counting on you, fairness, or policy decisions that will be made on the basis of your system. This book thoroughly surveys the operations you need to be running if you have this level of responsibility, and you can rest assured that it comes from combined decades of hard-won experience. —Andrew Moore, VP and General Manager Google Cloud AI MLOps wouldn’t be nearly as painful if we, the people who do machine learning, applied software engineering best practices. This is a well-written and comprehensive book on these engineering best practices from some of the world’s top experts. —Chip Huyen, author of Designing Machine Learning Systems Reliable Machine Learning is a must-read for people building real-world machine learning systems. It provides a blueprint for thinking about the complex and nuanced issues of developing machine learning enabled products. —Brian Spiering, Data Science Instructor

(This page has no text content)

Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, and Todd Underwood Reliable Machine Learning Applying SRE Principles to ML in Production Boston Farnham Sebastopol TokyoBeijing

978-1-098-10622-5 [LSI] Reliable Machine Learning by Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, and Todd Underwood Copyright © 2022 Capriole Consulting Inc., Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd Underwood, and O’Reilly Media, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: John Devins Development Editor: Angela Rufino Production Editor: Ashley Stussy Copyeditor: Sharon Wilkey Proofreader: Charles Roumeliotis Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea September 2022: First Edition Revision History for the First Edition 2022-09-19: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098106225 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Reliable Machine Learning, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The ML Lifecycle 1 Data Collection and Analysis 3 ML Training Pipelines 3 Build and Validate Applications 5 Quality and Performance Evaluation 6 Defining and Measuring SLOs 7 Launch 8 Monitoring and Feedback Loops 11 Lessons from the Loop 12 2. Data Management Principles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Data as Liability 16 The Data Sensitivity of ML Pipelines 21 Phases of Data 22 Creation 23 Ingestion 26 Processing 26 Storage 30 Management 32 Analysis and Visualization 32 Data Reliability 33 Durability 33 v

Consistency 34 Version Control 35 Performance 36 Availability 36 Data Integrity 36 Security 37 Privacy 37 Policy and Compliance 40 Conclusion 41 3. Basic Introduction to Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 What Is a Model? 43 A Basic Model Creation Workflow 44 Model Architecture Versus Model Definition Versus Trained Model 47 Where Are the Vulnerabilities? 47 Training Data 48 Labels 50 Training Methods 51 Infrastructure and Pipelines 54 Platforms 55 Feature Generation 55 Upgrades and Fixes 56 A Set of Useful Questions to Ask About Any Model 57 An Example ML System 59 Yarn Product Click-Prediction Model 59 Features 59 Labels for Features 61 Model Updating 61 Model Serving 62 Common Failures 63 Conclusion 64 4. Feature and Training Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Features 66 Feature Selection and Engineering 68 Lifecycle of a Feature 69 Feature Systems 71 Labels 77 Human-Generated Labels 77 Annotation Workforces 78 Measuring Human Annotation Quality 79 vi | Table of Contents

An Annotation Platform 80 Active Learning and AI-Assisted Labeling 81 Documentation and Training for Labelers 81 Metadata 82 Metadata Systems Overview 82 Dataset Metadata 83 Feature Metadata 84 Label Metadata 85 Pipeline Metadata 85 Data Privacy and Fairness 86 Privacy 86 Fairness 87 Conclusion 87 5. Evaluating Model Validity and Quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Evaluating Model Validity 90 Evaluating Model Quality 93 Offline Evaluations 93 Evaluation Distributions 94 A Few Useful Metrics 97 Operationalizing Verification and Evaluation 105 Conclusion 106 6. Fairness, Privacy, and Ethical ML Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Fairness (a.k.a. Fighting Bias) 108 Definitions of Fairness 112 Reaching Fairness 117 Fairness as a Process Rather than an Endpoint 120 A Quick Legal Note 121 Privacy 121 Methods to Preserve Privacy 124 A Quick Legal Note 126 Responsible AI 127 Explanation 128 Effectiveness 130 Social and Cultural Appropriateness 131 Responsible AI Along the ML Pipeline 132 Use Case Brainstorming 132 Data Collection and Cleaning 132 Model Creation and Training 133 Model Validation and Quality Assessment 133 Table of Contents | vii

Model Deployment 134 Products for the Market 134 Conclusion 135 7. Training Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Requirements 138 Basic Training System Implementation 140 Features 141 Feature Store 141 Model Management System 142 Orchestration 143 Quality Evaluation 144 Monitoring 144 General Reliability Principles 145 Most Failures Will Not Be ML Failures 145 Models Will Be Retrained 146 Models Will Have Multiple Versions (at the Same Time!) 146 Good Models Will Become Bad 147 Data Will Be Unavailable 148 Models Should Be Improvable 149 Features Will Be Added and Changed 149 Models Can Train Too Fast 150 Resource Utilization Matters 151 Utilization != Efficiency 152 Outages Include Recovery 154 Common Training Reliability Problems 154 Data Sensitivity 154 Example Data Problem at YarnIt 155 Reproducibility 155 Example Reproducibility Problem at YarnIt 157 Compute Resource Capacity 159 Example Capacity Problem at YarnIt 159 Structural Reliability 160 Organizational Challenges 160 Ethics and Fairness Considerations 161 Conclusion 162 8. Serving. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Key Questions for Model Serving 164 What Will Be the Load to Our Model? 164 What Are the Prediction Latency Needs of Our Model? 165 viii | Table of Contents

Where Does the Model Need to Live? 166 What Are the Hardware Needs for Our Model? 167 How Will the Serving Model Be Stored, Loaded, Versioned, and Updated? 169 What Will Our Feature Pipeline for Serving Look Like? 170 Model Serving Architectures 171 Offline Serving (Batch Inference) 171 Online Serving (Online Inference) 174 Model as a Service 176 Serving at the Edge 178 Choosing an Architecture 181 Model API Design 181 Testing 183 Serving for Accuracy or Resilience? 183 Scaling 185 Autoscaling 185 Caching 185 Disaster Recovery 186 Ethics and Fairness Considerations 187 Conclusion 188 9. Monitoring and Observability for Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 What Is Production Monitoring and Why Do It? 189 What Does It Look Like? 190 The Concerns That ML Brings to Monitoring 192 Reasons for Continual ML Observability—in Production 193 Problems with ML Production Monitoring 193 Difficulties of Development Versus Serving 194 A Mindset Change Is Required 196 Best Practices for ML Model Monitoring 196 Generic Pre-serving Model Recommendations 197 Training and Retraining 198 Model Validation (Before Rollout) 202 Serving 205 Other Things to Consider 216 High-Level Recommendations for Monitoring Strategy 221 Conclusion 223 10. Continuous ML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Anatomy of a Continuous ML System 226 Training Examples 227 Training Labels 227 Table of Contents | ix

Filtering Out Bad Data 227 Feature Stores and Data Management 228 Updating the Model 228 Pushing Updated Models to Serving 229 Observations About Continuous ML Systems 230 External World Events May Influence Our Systems 230 Models Can Influence Their Own Training Data 232 Temporal Effects Can Arise at Several Timescales 234 Emergency Response Must Be Done in Real Time 235 New Launches Require Staged Ramp-ups and Stable Baselines 239 Models Must Be Managed Rather Than Shipped 241 Continuous Organizations 242 Rethinking Noncontinuous ML Systems 245 Conclusion 246 11. Incident Response. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Incident Management Basics 248 Life of an Incident 249 Incident Response Roles 250 Anatomy of an ML-Centric Outage 251 Terminology Reminder: Model 252 Story Time 252 Story 1: Searching but Not Finding 252 Story 2: Suddenly Useless Partners 257 Story 3: Recommend You Find New Suppliers 264 ML Incident Management Principles 274 Guiding Principles 274 Model Developer or Data Scientist 275 Software Engineer 277 ML SRE or Production Engineer 278 Product Manager or Business Leader 281 Special Topics 282 Production Engineers and ML Engineering Versus Modeling 282 The Ethical On-Call Engineer Manifesto 284 Conclusion 286 12. How Product and ML Interact. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Different Types of Products 289 Agile ML? 290 ML Product Development Phases 290 Discovery and Definition 291 x | Table of Contents

Business Goal Setting 292 MVP Construction and Validation 295 Model and Product Development 296 Deployment 296 Support and Maintenance 297 Build Versus Buy 298 Models 298 Data Processing Infrastructure 299 End-to-End Platforms 300 Scoring Approach for Making the Decision 301 Making the Decision 301 Sample YarnIt Store Features Powered by ML 302 Showcasing Popular Yarns by Total Sales 302 Recommendations Based on Browsing History 303 Cross-selling and Upselling 303 Content-Based Filtering 303 Collaborative Filtering 304 Conclusion 305 13. Integrating ML into Your Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Chapter Assumptions 308 Leader-Based Viewpoint 308 Detail Matters 308 ML Needs to Know About the Business 309 The Most Important Assumption You Make 310 The Value of ML 311 Significant Organizational Risks 312 ML Is Not Magic 312 Mental (Way of Thinking) Model Inertia 312 Surfacing Risk Correctly in Different Cultures 313 Siloed Teams Don’t Solve All Problems 314 Implementation Models 314 Remembering the Goal 315 Greenfield Versus Brownfield 316 ML Roles and Responsibilities 316 How to Hire ML Folks 317 Organizational Design and Incentives 318 Strategy 319 Structure 320 Processes 321 Rewards 321 Table of Contents | xi

People 322 A Note on Sequencing 322 Conclusion 323 14. Practical ML Org Implementation Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Scenario 1: A New Centralized ML Team 325 Background and Organizational Description 325 Process 326 Rewards 327 People 328 Default Implementation 329 Scenario 2: Decentralized ML Infrastructure and Expertise 329 Background and Organizational Description 329 Process 330 Rewards 331 People 332 Default Implementation 332 Scenario 3: Hybrid with Centralized Infrastructure/Decentralized Modeling 332 Background and Organizational Description 333 Process 333 Rewards 334 People 334 Default Implementation 334 Conclusion 335 15. Case Studies: MLOps in Practice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 1. Accommodating Privacy and Data Retention Policies in ML Pipelines 337 Background 337 Problem and Resolution 338 Takeaways 340 2. Continuous ML Model Impacting Traffic 341 Background 341 Problem and Resolution 341 Takeaways 342 3. Steel Inspection 343 Background 343 Problem and Resolution 344 Takeaways 348 4. NLP MLOps: Profiling and Staging Load Test 348 Background 348 Problem and Resolution 349 xii | Table of Contents

Takeaways 353 5. Ad Click Prediction: Databases Versus Reality 353 Background 353 Problem and Resolution 354 Takeaways 355 6. Testing and Measuring Dependencies in ML Workflow 356 Background 356 Problem and Resolution 357 Takeaways 361 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Table of Contents | xiii

(This page has no text content)

Foreword Machine learning (ML) is at the heart of a tremendous wave of technological innova‐ tion that has only just begun. Picking up where the “data-driven” wave of the 2000s left off, ML enables a new era of model-driven decision making that promises to improve organizational performance and enhance customer experiences by allowing machines to make near-instantaneous, high-fidelity decisions, at the point of interac‐ tion, based on the most current information available. To support the productive use of ML models, the practice of machine learning has had to evolve rapidly from a primarily academic pursuit to a fully fledged engineering discipline. What was once the sole domain of researchers, research scientists, and data scientists is now, at least equally, the responsibility of ML engineers, MLOps engineers, software engineers, data engineers, and more. Part of what we see in the evolution of machine learning roles is a healthy shift in focus from simply trying to get models to work to ensuring that they work in a way that meets the needs of the organization. This means building systems that allow the organization to produce and deliver them efficiently, hardening them against failure, enabling recovery from any failures that do happen, and most importantly doing all this in the context of a learning loop that helps the organization improve from one project to the next. Fortunately, the machine learning community hasn’t had to bootstrap the knowledge required to accomplish all this from scratch. Practitioners of what has come to be called MLOps have had the benefit of a vast array of knowledge that was developed through the practice of DevOps for traditional software projects. The first wave of MLOps focused on the application of technology and process disci‐ pline to the development and deployment of models, resulting in a greater ability for organizations to move models from “the lab” to “the factory,” as well as an explosion of tools and platforms for supporting those stages of the ML lifecycle. xv

1 D. Sculley et al. “Hidden Technical Debt in Machine Learning Systems,” Advances in Neural Information Processing Systems (January 2015): 2494-2502. https://oreil.ly/lK0WR. 2 Todd Underwood, “When Good Models Go Bad: The Damage Caused by Wayward Models and How to Prevent It,” TWIMLcon, 2021, https://oreil.ly/7pspJ. 3 D. Sculley, “Data Debt in Machine Learning,” interview by Sam Charrington, The TWIML AI Podcast, May 19, 2022, https://oreil.ly/887p4. But what about the ops in MLOps? Here again we stand to benefit from progress made operating traditional software systems. A significant contributor to maturing the operational side of DevOps was that community’s broader awareness and applica‐ tion of site reliability engineering (SRE), a set of principles and practices developed at Google and many other organizations that sought to apply engineering discipline to the challenges of operating large-scale, mission-critical software systems. The application of methodologies from software engineering to machine learning is not a simple lift and shift, however. While one has much to learn from the other, the concerns, challenges, and solutions can differ quite significantly in practice. That is where this book comes in. Rather than leaving it to each individual or team to identify how to apply SRE principles to their machine learning workflow, the authors of this book aim to give you a head start by sharing what has worked for them at Google, Apple, Microsoft, and other organizations. To say that the authors are well qualified for their task is an understatement. My work has been deeply informed and influenced by several of them over the years. In the fall of 2019, I organized the first TWIMLcon: AI Platforms conference to provide a venue for the then-nascent MLOps community to share experiences and advance the practice of building processes, tooling, and platforms for supporting the end-to-end machine learning workflow. Among us insiders it became a bit of a running joke just how many of the presentations at the event included a rendition of the “real-world ML systems” diagram from D. Sculley’s seminal paper, “Hidden Technical Debt in Machine Learning Systems.”1 At our second conference, in 2021, Todd Underwood joined us to present “When Good Models Go Bad: The Damage Caused by Wayward Models and How to Prevent It.”2 The talk shared the results of a hand analysis of approximately 100 incidents tracked over 10 years in which bad ML models made it, or nearly made it, into production. I’ve since had the pleasure of interviewing D. for The TWIML AI Podcast for an episode titled “Data Debt in Machine Learning.”3 The depth of experience D. and Todd shared in these interactions comes through clearly in this book. xvi | Foreword

And, if you’re coming from the SRE perspective, Niall needs no introduction. His books Site Reliability Engineering and The Site Reliability Workbook helped popularize SRE among DevOps practitioners in 2016 and beyond. (Though I’ve not previously come across Cathy and Kranti’s work, it is clear that their experience structuring SRE organizations and driving large-scale consumer-facing applications of ML informs many aspects of the book, particularly the chapters on implementing ML organizations and integrating ML into products.) This book provides a valuable lens into the authors’ experiences building, operating, and scaling some of the largest machine learning systems around. The authors avoid falling into the trap of attempting to document a static set of archi‐ tectures, tools, or recommendations, and in so doing succeed at offering so much more: a survey of the vast complexity and myriad considerations that teams must navigate to build and operate—and to build operable—machine learning systems, along with the principles and best practices the authors have collected through their own extensive navigation of the terrain. Their goal is stated early on in the text: to “enumerate enough of the complexity to dissuade any readers from simply thinking… ‘this stuff is easy.’” If we’ve learned anything as a community over the past several years it’s that the ability to create, deliver, and operate ML models in an efficient, repeatable, and scalable manner is far from easy. We’ve also learned, though, that because of its will‐ ingness to openly share experiences and build on the learnings of others, the machine learning community is able to advance rapidly, and what’s hard today becomes easier tomorrow. I’m grateful to Cathy, Niall, Kranti, D., and Todd for allowing us all to benefit from their hard won lessons and for helping to advance the state of machine learning in production in the process. — Sam Charrington Founder of TWIML, host of The TWIML AI Podcast Foreword | xvii

(This page has no text content)

Statistics

Uploader

Reliable Machine Learning Applying SRE Principles to ML in Production (Cathy Chen, Niall Richard Murphy, Kranti Parisa etc.) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Statistics

Uploader

Reliable Machine Learning Applying SRE Principles to ML in Production (Cathy Chen, Niall Richard Murphy, Kranti Parisa etc.) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment