Machine Learning System Design With end-to-end examples (Valerii Babushkin, Arseny Kravchenko) (Z-Library)

M A N N I N G Valerii Babushkin Arseny Kravchenko With end-to-end examples

Machine Learning System Design

ii

Machine Learning System Design WITH END-TO-END EXAMPLES VALERII BABUSHKIN ARSENY KRAVCHENKO M A N N I N G SHELTER ISLAND

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2025 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The author and publisher have made every effort to ensure that the information in this book was correct at press time. The author and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Manning Publications Co. Development editor: Doug Rudder 20 Baldwin Road Technical editor: Ioannis Atsonios PO Box 761 Review editor: Kishor Rit Shelter Island, NY 11964 Production editor: Andy Marinkovich Copy editor: Kari Lucke Proofreader: Tiffany Taylor Typesetter and cover designer: Marija Tudor ISBN 9781633438750 Printed in the United States of America

brief contents PART 1 PREPARATIONS ................................................................ 1 1 ■ Essentials of machine learning system design 3 2 ■ Is there a problem? 17 3 ■ Preliminary research 31 4 ■ Design document 48 PART 2 EARLY STAGE ................................................................ 63 5 ■ Loss functions and metrics 65 6 ■ Gathering datasets 91 7 ■ Validation schemas 114 8 ■ Baseline solution 136 PART 3 INTERMEDIATE STEPS .................................................. 153 9 ■ Error analysis 155 10 ■ Training pipelines 185 11 ■ Features and feature engineering 203 12 ■ Measuring and reporting results 234 PART 4 INTEGRATION AND GROWTH ........................................ 261 13 ■ Integration 263v

BRIEF CONTENTSvi14 ■ Monitoring and reliability 282 15 ■ Serving and inference optimization 311 16 ■ Ownership and maintenance 330

contents preface xiv acknowledgments xv about this book xvii about the authors xx about the cover illustration xxi PART 1 PREPARATIONS ................................................ 1 1 Essentials of machine learning system design 3 1.1 ML system design: What are you? 4 Why ML system design is so important 8 ■ Roots of ML system design 8 1.2 How this book is structured 10 1.3 When principles of ML system design can be helpful 12 2 Is there a problem? 17 2.1 Problem space vs. solution space 18 2.2 Finding the problem 21 How we can approximate a solution through an ML system 24 2.3 Risks, limitations, and possible consequences 26 2.4 Costs of a mistake 28vii

CONTENTSviii3 Preliminary research 31 3.1 What problems can inspire you? 32 3.2 Build or buy: Open source-based or proprietary tech 34 Build or buy 34 ■ Open source-based or proprietary tech 36 3.3 Problem decompositioning 36 3.4 Choosing the right degree of innovation 41 What solutions can be useful? 42 ■ Working on the solution space: Practical example 44 4 Design document 48 4.1 Common myths surrounding the design document 49 Myth #1. Design documents work only for big companies but not startups 49 ■ Myth #2. Design documents are efficient only for complex projects 50 ■ Myth #3. Every design document should be based on a template 50 ■ Myth #4. Every design document should lead to a deployed system 50 4.2 Goals and antigoals 51 4.3 Design document structure 54 4.4 Reviewing a design document 57 Design document review example 59 4.5 A design doc is a living thing 60 PART 2 EARLY STAGE ................................................. 63 5 Loss functions and metrics 65 5.1 Losses 66 Loss tricks for deep learning models 69 5.2 Metrics 70 Consistency metrics 79 ■ Offline and online metrics, proxy metrics, and hierarchy of metrics 81 5.3 Design document: Adding losses and metrics 84 Metrics and loss functions for Supermegaretail 84 Metrics and loss functions for PhotoStock Inc. 88 Wrap up 90

CONTENTS ix6 Gathering datasets 91 6.1 Data sources 92 6.2 Cooking the dataset 94 ETL 94 ■ Filtering 95 ■ Feature engineering 96 Labeling 96 6.3 Data and metadata 100 6.4 How much data is enough? 101 6.5 Chicken-or-egg problem 104 6.6 Properties of a healthy data pipeline 106 6.7 Design document: Dataset 108 Dataset for Supermegaretail 108 ■ Dataset for PhotoStock Inc. 111 7 Validation schemas 114 7.1 Reliable evaluation 115 7.2 Standard schemas 116 Holdout sets 116 ■ Cross-validation 117 ■ The choice of K 118 ■ Time-series validation 119 7.3 Nontrivial schemas 121 Nested validation 122 ■ Adversarial validation 123 Quantifying dataset leakage exploitation 123 7.4 Split updating procedure 124 7.5 Design document: Choosing validation schemas 131 Validation schemas for Supermegaretail 131 ■ Validation schemas for PhotoStock Inc. 134 8 Baseline solution 136 8.1 Baseline: What are you? 137 8.2 Constant baselines 139 Why do we need constant baselines? 141 8.3 Model baselines and feature baselines 142 8.4 Variety of deep learning baselines 144 8.5 Baseline comparison 146 8.6 Design document: Baselines 148 Baselines for Supermegaretail 148 ■ Baselines for PhotoStock Inc. 150

CONTENTSxPART 3 INTERMEDIATE STEPS ................................... 153 9 Error analysis 155 9.1 Learning curve analysis 156 Overfitting and underfitting 157 ■ Loss curve 158 Interpreting loss curves 159 ■ Model-wise learning curve 162 Sample-wise learning curve 163 ■ Double descent 163 9.2 Residual analysis 164 Goals of residual analysis 166 ■ Model assumptions 167 Residual distribution 170 ■ Fairness of residuals 172 Underprediction and overprediction 173 ■ Elasticity curves 174 9.3 Finding commonalities in residuals 175 Worst/best-case analysis 176 ■ Adversarial validation 177 Variety of group analysis 177 ■ Corner-case analysis 178 9.4 Design document: Error analysis 179 Error analysis for Supermegaretail 179 ■ Error analysis for PhotoStock Inc. 183 10 Training pipelines 185 10.1 Training pipeline: What are you? 185 Training pipeline vs. inference pipeline 186 10.2 Tools and platforms 190 10.3 Scalability 191 10.4 Configurability 193 10.5 Testing 196 Property-based testing 197 10.6 Design document: Training pipelines 198 Training pipeline for Supermegaretail 199 ■ Training pipeline for PhotoStock Inc. 200 11 Features and feature engineering 203 11.1 Feature engineering: What are you? 204 Criteria of good and bad features 205 ■ Feature generation 101 206 ■ Model predictions as a feature 208 11.2 Feature importance analysis 209 Classification of methods 211 ■ Accuracy–interpretability tradeoff 213 ■ Feature importance in deep learning 213

CONTENTS xi11.3 Feature selection 216 Feature generation vs. feature selection 216 ■ Goals and possible drawbacks 216 ■ Feature selection method overview 218 11.4 Feature store 221 Feature store: Pros and cons 223 ■ Desired properties of a feature store 225 ■ Feature catalog 229 11.5 Design document: Feature engineering 229 Features for Supermegaretail 229 ■ Features for PhotoStock Inc. 231 12 Measuring and reporting results 234 12.1 Measuring results 235 Model performance 235 ■ Transition to business metrics 236 ■ Simulated environment 237 ■ Human evaluation 241 12.2 A/B testing 241 Experiment design 242 ■ Splitting strategy 244 ■ Selecting metrics 245 ■ Statistical criteria 246 ■ Simulated experiments 247 ■ When A/B testing is not possible 248 12.3 Reporting results 248 Control and auxiliary metrics 249 ■ Uplift monitoring 249 When to finish the experiment 250 ■ What to report 251 Debrief document 251 12.4 Design document: Measuring and reporting 252 Measuring and reporting for Supermegaretail 252 ■ Measuring and reporting for PhotoStock Inc. 254 PART 4 INTEGRATION AND GROWTH ........................ 261 13 Integration 263 13.1 API design 264 API practices 268 13.2 Release cycle 269 13.3 Operating the system 273 Tech-related connections 273 ■ Non-tech-related connections 274 13.4 Overrides and fallbacks 274

CONTENTSxii13.5 Design document: Integration 276 Integration for Supermegaretail 276 ■ Integration for PhotoStock Inc. 279 14 Monitoring and reliability 282 14.1 Why monitoring is important 283 Incoming data 284 ■ Model 284 ■ Model output 285 Postprocessing/decision-making 286 14.2 Software system health 287 14.3 Data quality and integrity 288 Processing problems 288 ■ Data source corruption 289 Cascade/upstream models 290 ■ Schema change 291 Training-serving skew 291 ■ How to monitor and react 292 14.4 Model quality and relevance 295 Data drift 297 ■ Concept drift 298 ■ How to monitor 299 ■ How to react 301 14.5 Design document: Monitoring 306 Monitoring for Supermegaretail 306 ■ Monitoring for PhotoStock Inc. 308 15 Serving and inference optimization 311 15.1 Serving and inference: Challenges 312 15.2 Tradeoffs and patterns 314 Tradeoffs 314 ■ Patterns 317 15.3 Tools and frameworks 318 Choosing a framework 319 ■ Serverless inference 321 15.4 Optimizing inference pipelines 323 Starting with profiling 323 ■ The best optimizing is minimum optimizing 325 15.5 Design document: Serving and inference 325 Serving and inference for Supermegaretail 326 ■ Serving and inference for PhotoStock Inc. 327 16 Ownership and maintenance 330 16.1 Accountability 331 16.2 Bus factor 336

CONTENTS xiiiWhy is being too efficient not beneficial? 336 ■ Why is being too redundant not beneficial? 337 ■ When and how to use the bus factor 337 16.3 Documentation 338 16.4 Complexity 340 16.5 Maintenance and ownership: Supermegaretail and PhotoStock Inc. 343 index 345

preface Neither Arseny’s (online marketing) nor Valerii’s (chemometrics) early careers had much to do with machine learning (ML). However, it was the mathematical tools of our trades like regression models and principal component analysis that sparked our obses- sion with extracting maximum value from data. Each of us started our journey in the early 2010s, with Valerii ultimately taking on corporate leadership roles in data science at companies such as Facebook, Alibaba, Blockchain.com, and BP, and Arseny honing his engineering skills within deep-tech startups at various stages of growth. Before joining efforts in writing this book, the only shared piece of our careers was related to ML competitions, when we sharpened our skills on Kaggle. Valerii has reached Grandmaster status and was formerly ranked in the Top 30 globally. Arseny is a Kaggle Master with vast expertise in competitive ML, while both authors strive to share their knowledge and experience as public speakers on ML-related topics. Our collaboration blends Arseny’s hands-on experience with building and optimiz- ing ML systems with Valerii’s strategic vision and leadership in data-driven enterprises. From real-time video processing and retail optimization to financial transactions anal- ysis, we have tried to distill our combined expertise into the essential principles for building functional ML systems. Despite the difference in our career paths, we both discovered a gap between knowing ML algorithms and understanding how to effectively apply them in real- world scenarios. We saw brilliant minds struggling to connect the dots and combine fragmented knowledge into a coherent picture. It was this challenge that inspired us to write this book.xiv

acknowledgments This book would never have been brought to life without the active support of our friends and colleagues. We would like to express our gratitude to Bogdan Pechenkin for his direct contri- bution to preparing draft chapters and to Igor Kotenkov, Rustem Feyzkhanov, Evgenii Makarov, and Adam Eldarov for conducting the review during the work-in-progress stages. We would like to thank Simon Kozlov and Sam Weiss, whose approach to problem- solving inspired many pieces of this book; Sergey Foris for his contribution as an edi- tor; and Tatyana Putilova for enlivening the book with beautiful yet informative illus- trations. Thank you to our technical editor, Ioannis Atsonios, who has worked in academia, consulting, and industry in various positions including data product ideation, craft of proof of concepts, and actual productization, in addition to carrying out extensive research and development in machine learning, particularly in personalization systems. We would like to give our kudos to all the reviewers: Aleksei Agarkov, Antonios Tsaltas, Arijit Dasgupta, Craig Henderson, Dinesh Ghanta, Flayol Frédéric, George Onofrei, Konstantin Kliakhandler, Lakshminarayanan A.S., Laura Uzcategui, Lucian M. Sasu, Maxim Volgin, Mikael Dautrey, Mike Wright, Mirerfan Gheibi, Nikos Kana- karis, Ninoslav Cerkez, Odysseas Pentakalso, Oliver Korten, Prashant Kowshik, Robert Diana, Sriram Macharla, Stephen John Warnett, Stipe Cuvalo, Vatsal Desai, Vishnu Ram Venkataraman, and William Jamir Silva; we thank you for your critiques and invaluable insights, which helped fine-tune numerous individual blocks of text and gave the book additional sharpness and focus. xv

ACKNOWLEDGMENTSxvi Thanks also to the early access book readers and the students of our courses on ML system design, who challenged us with great questions and offered valuable suggestions. We extend our heartfelt gratitude to our families and friends, whose patience and understanding allowed us the time and focus needed to bring this book to life. Your unwavering support made this journey possible, and for that, we are deeply thankful. Finally, a huge thank you to the Manning team for providing us with the opportu- nity to publish this book and for their guidance in every step of the process.

about this book Machine Learning System Design is a comprehensive step-by-step guide designed to help you work on your ML system at every stage of its creation—from gathering informa- tion and taking preliminary steps to implementation, release, and ongoing maintenance. As the title suggests, the book is dedicated to ML system design, not focusing on a particular technology but rather providing a high-level framework on how to approach problems related to building, maintaining, and improving ML systems of various scales and levels of complexity. As ML and AI are getting bigger and bigger these days, there are many books and courses on algorithms, domains, and other specific aspects. However, they don’t pro- vide an entire vision. This leads to the problem Arseny and Valerii have seen in multi- ple companies, where solid engineers successfully build scattered subcomponents that can’t be combined into a fully functioning, reliable system. This book aims to, among other things, fill this gap. This book is not beginner friendly. We expect our readers to be familiar with ML basics (you can understand an ML textbook for undergraduate students) and to be fluent in applied programming (you have faced real programming challenges outside the studying sandbox). Who should read this book? We hope this book will be helpful to  Mid-career engineers: to hone their skills in building and maintaining solid ML systems and make sure they don’t miss anything critical.xvii

ABOUT THIS BOOKxviii Engineering managers and senior engineers: to fill the gaps in their knowledge and view ML system design from a broader perspective.  Those starting their journey in applied ML: to have structured guidelines at hand before kicking off their first ML system. How this book is organized: A roadmap The book structure is designed as a checklist or manual, with an infusion of campfire stories from our own experience. It can be read all at once or used at any moment while working on a specific aspect of a ML system. At the same time, we try not to slip into sounding like a typical textbook or course on classic machine learning or deep learning. We’ve split the book into four main parts so that its structure is in line with the life cycle of any system:  Discovery  Building a core  Improvement  Maintenance Chapters 1–8 are based around the early stages of ML system design. Throughout chapters 1–4, we focus on overall awareness and understanding of the problem your system needs to solve and define the steps needed before system development has started. This phase rarely involves writing code and mostly focuses on small prototypes or proofs of concepts. Chapters 5–8 delve into the technical details of the early-stage work. This stage requires a lot of reading and communicating, which is crucial for understanding a problem, defining a landscape for possible solutions, and aligning expectations with other project participants. If we compare an ML system to a human body, it’s about forming a skeleton. Chapters 9–12 are focused on intermediate steps. At this stage of the system life cycle, the schedule of responsible engineers is usually flipped, and there is way less research and communication and more hands-on work on implementing and improv- ing the system. Here, we focus on such questions as how to make the system solid, accurate, and reliable. Continuing the human body metaphor, this is where the system grows its muscles. The final part, featuring chapters 13–16, is dedicated to integration and growth. For an inexperienced observer, the system may seem ready to go, but this is a tricky impres- sion. There are multiple (mostly engineering-related) aspects that need to be taken into account before the system goes live successfully. In the software world, a system fail- ure is rarely a disaster like in civil engineering, but it’s still an unwanted scenario. So at this stage you will learn how to make your system reliable, maintainable, and future- proof. If you’re still not tired of human body metaphors, this is where the system gets its wisdom, because untamed strength can lead to nothing but trouble.

ABOUT THIS BOOK xixliveBook discussion forum Purchase of Machine Learning System Design includes free access to liveBook, Manning’s online reading platform. Using liveBook’s exclusive discussion features, you can attach comments to the book globally or to specific sections or paragraphs. It’s a snap to make notes for yourself, ask and answer technical questions, and receive help from the author and other users. To access the forum, go to https://livebook.manning.com/book/ machine-learning-system-design/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/discussion. Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the authors can take place. It is not a commitment to any specific amount of participation on the part of the authors, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking them some challenging questions lest their interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website for as long as the book is in print.

Statistics

Uploader

Machine Learning System Design With end-to-end examples (Valerii Babushkin, Arseny Kravchenko) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Statistics

Uploader

Machine Learning System Design With end-to-end examples (Valerii Babushkin, Arseny Kravchenko) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment