(This page has no text content)
(This page has no text content)
(This page has no text content)
KRISTEN KEHRER AND CALEB KAISER MACHINE LEARNING UPGRADE A Data Scientist’s Guide to MLOps®, LLMs, and ML Infrastructure
Copyright © 2024 by John Wiley & Sons, Inc. All rights, including for text and data mining, AI training, and similar technologies, are reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada and the United Kingdom. ISBNs: 9781394249633 (paperback), 9781394249664 (ePDF), 9781394249640 (ePub) No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per- copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750- 8400, fax (978) 750- 4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748- 6011, fax (201) 748- 6008, or online at www.wiley.com/go/permission. Trademarks: Wiley, the Wiley logo, are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. MLOps is a registered trademark of Datarobot, Inc. All other trade- marks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be cre- ated or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services, please contact our Customer Care Department within the United States at (800) 762- 2974, outside the United States at (317) 572- 3993. For product technical support, you can find answers to frequently asked questions or reach us via live chat at https://support.wiley.com. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging in Publication data available on request. Cover images: Neon: © bortonia9/Getty Images Background: © imaginima/Getty Images Cover design: Wiley
Introduction ix 1 A Gentle Introduction to Modern Machine Learning 1 Data Science Is Diverging from Business Intelligence 3 From CRISP-DM to Modern, Multicomponent ML Systems 4 The Emergence of LLMs Has Increased ML’s Power and Complexity 7 What You Can Expect from This Book 9 2 An End-to-End Approach 11 Components of a YouTube Search Agent 13 Principles of a Production Machine Learning System 16 Observability 19 Reproducibility 19 Interoperability 20 Scalability 21 Improvability 22 A Note on Tools 23 3 A Data-Centric View 25 The Emergence of Foundation Models 25 The Role of Off-the-Shelf Components 27 The Data-Driven Approach 28 A Note on Data Ethics 28 Contents v
Building the Dataset 30 Working with Vector Databases 34 Data Versioning and Management 50 Getting Started with Data Versioning 53 Knowing “Just Enough” Engineering 57 4 Standing Up Your LLM 61 Selecting Your LLM 61 What Type of Inference Do I Need to Perform? 65 How Open-Ended Is This Task? 66 What Are the Privacy Concerns for This Data? 66 How Much Will This Model Cost? 67 Experiment Management with LLMs 68 LLM Inference 74 Basics of Prompt Engineering 74 In-Context Learning 77 Intermediary Computation 85 Augmented Generation 89 Agentic Techniques 94 Optimizing LLM Inference with Experiment Management 102 Fine-Tuning LLMs 111 When to Fine-Tune an LLM 112 Quantization, QLOrA, and Parameter Efficient Fine-Tuning 113 Wrapping Things Up 121 5 Putting Together an Application 123 Prototyping with Gradio 125 Creating Graphics with Plotnine 128 Adding the Author Selector 137 Adding a Logo 138 vi Contents
vii Contents Adding a Tab 139 Adding a Title and Subtitle 140 Changing the Color of the Buttons 140 Click to Download Button 141 Putting It All Together 141 Deploying Models as APIs 144 Implementing an API with FastAPI 146 Implementing Uvicorn 148 Monitoring an LLM 149 Dockerizing Your Service 151 Deploying Your Own LLM 154 Wrapping Things Up 159 6 Rounding Out the ML Life Cycle 161 Deploying a Simple Random Forest Model 161 An Introduction to Model Monitoring 167 Model Monitoring with Evidently AI 175 Building a Model Monitoring System 176 Final Thoughts on Monitoring 187 7 Review of Best Practices 189 Step 1: Understand the Problem 189 Step 2: Model Selection and Training 190 Step 3: Deploy and Maintain 192 Step 4: Collaborate and Communicate 196 Emerging Trends in LLMs 197 Next Steps in Learning 199 Appendix: Additional LLM Example 201 Index 209
(This page has no text content)
ix Welcome to a journey through the dynamic world of modern machine learning (ML)! In this book, we’ll guide you from the data scientist’s role with historical roots in business intelligence to the forefront of today’s cutting- edge, multicomponent systems. You can find a GitHub with code examples from the book at https:// github.com/machine- learning- upgrade so you can follow along. We intend this book to be something you can read all the way through. This is not an index of methods or a comprehensive book on machine learning. Our aim is to cover the challenges associated with modern- day machine learning with a particular focus on data versioning, experiment tracking, post- production model monitoring, and deployment to equip you with the code and examples to start leveraging best practices immediately. Chapter 1 lays the groundwork, revealing how the workflow for managing machine learning has evolved from traditional, more linear frameworks for data science like CRISP- DM to the advent of language model- powered applications, or large language models (LLMs). We set the stage by emphasizing the need for a unified framework that hints at the thrilling path ahead— building an LLM- powered applica- tion together. As we delve into Chapter 2, prepare to witness an end- to- end approach to machine learning, exploring its life cycle, the principles Introduction
x Introduction of a production machine learning system and the core of our LLM application. Chapter 3 zooms in on the data- centric view, emphasizing the role of data in modern ML. This is a hands- on chapter, where we cre- ate embeddings and harness the power of vector databases for text similarity searches. We couple ethical guidelines and data versioning strategies to ensure a responsible and comprehensive approach. Then comes Chapter 4, where we guide you through selecting the right LLM, leveraging LangChain, and fine- tuning LLM performance. With each part seamlessly connected, we venture into Chapter 5 to assemble our components and transition from prototype to application. We also demonstrate how to build dashboarding and application programming interfaces (APIs) to make your model results available to end users. But it doesn’t stop there. Chapter 6 completes the ML life cycle, tackling model monitoring, retraining pipelines, and envisioning future deployment strategies and stakeholder communication. Finally, in Chapter 7, we recap the best practices uncovered throughout this journey, explore emerging trends in LLMs, and pro- vide resources for further learning. This book is more than a guide— it’s an adventure, an invita- tion to traverse the landscapes of modern machine learning, and an opportunity to equip yourself with the tools and knowledge to navi- gate it. So, fasten your seatbelt, friends, and let’s get going!
xi Writing this book has been a collaborative journey filled with shared vision and support from an incredible network of indi- viduals who have made this possible. We are immensely grateful to the team at Wiley, particularly James Minatel and Gus Miklos, whose dedication and expertise transformed our manuscript into a polished book. We appreciate your commit- ment to excellence. Our profound appreciation goes to the technical editor, Harpreet Sahota, who provided invaluable feedback and challenged us to refine our ideas and improve the manuscript. Your insights and guidance were crucial in shaping the final book. We extend our heartfelt thanks to the readers who will engage with our collective work. We hope this book offers valuable insights and sparks new ideas in your explorations. To each person who has contributed, directly or indirectly, to this collaborative effort, thank you for being part of this journey. With gratitude, Kristen Kehrer Caleb Kaiser Acknowledgments
(This page has no text content)
xiii Kristen Kehrer has been a builder and tinkerer delivering machine learning models since 2010 for e- commerce, healthcare, and utility companies. Ranked as a global LinkedIn Top Voice in data science and analytics in 2018 with 95,000 followers in data science, Kristen is the creator of Data Moves Me. She was previously a faculty mem- ber and subject- matter expert at Emeritus Institute of Management. Kristen earned an MS in applied statistics from Worcester Polytechnic Institute and a BS in mathematics. Caleb Kaiser is a full-stack engineer at Comet. Caleb was previously on the founding team at Cortex Labs. He also worked at Scribe Media on the author platform team and completed a BA of fine art in writ- ing from the School of the Art Institute of Chicago. About the Authors
(This page has no text content)
xv Harpreet Sahota is a self- described generative AI hacker. He earned undergraduate and graduate degrees in statistics and mathematics and has been in the “data world” since 2013, working as an actu- ary, biostatistician, data scientist, and machine learning engineer with expertise in statistics, machine learning, MLOps, LLMOps, and gen- erative AI (with a focus on multimodal retrieval augmented genera- tion). He loves tinkering with new technology and spending time with his wife, Romie, and kids, Jugaad and Jind. His book Practical Retrieval Augmented Generation will publish with Wiley in 2025. About the Technical Editor
(This page has no text content)
1 Chapter 1 A Gentle Introduction to Modern Machine Learning Over the last 20 years, data science has largely been focused on using data to inform business decisions. Typical data science projects have centered around gathering, cleaning, and modeling data or creating a dashboard before finally producing a presentation to share results with stakeholders. This pipeline has been the back- bone of many important business decisions. It has driven quite a bit of revenue. There have been many, many dashboards. Traditionally, we might refer to the projects where we perform descriptive analysis to make informed decisions as business intelli- gence (BI). And in theory, BI is a specific field that is part of the data sciences. Data science technically refers more broadly to the practice of applying statistical methods (including modeling), coding, and domain knowledge to data, whereas business intelligence more nar- rowly applies to taking a data- driven approach to business deci- sions that focus more on descriptive and diagnostic analytics than the predictive analytics you might see from data scientists. However, we consider all analysts and BI professionals to be working in the “data sciences.” In practice, if you’ve had a job as an analyst or a data scientist in the last decade, you’ve probably spent a lot of time on business intelligence, in one way or another.
2 Machine Learning Upgrade Many people might cry foul on this claim, pointing out that busi- ness intelligence, as we traditionally think of it, falls under the domain of roles with titles like “BI analyst,” while data scientists tend to have more varied and research- focused responsibilities. While that might be true on some level, breaking down the responsibilities and func- tions of different roles in an average analytics organization makes it difficult to neatly separate data science from BI, and there will always be overlap when working with data. For example, as an analyst at an average company, you’d likely be responsible for answering “what happened?” questions, using descriptive analytics to provide a snapshot of past performance. You might use Excel, SQL, and visualization software to generate reports and dashboards. You would likely monitor key performance indica- tors (KPIs) and help make strategic decisions based on historical data. There is also a chance that as a BI analyst or a business analyst the KPIs, data sources, and machine learning models (if any) used in this process are set up for you before you start the project— you manage them. Now, when the company has completely new data to process, it is often made available via self- service for nontechnical stakehold- ers through yet another dashboard (this is where those excruciating debates about Tableau versus Power BI can crop up). In general, this is a useful way to think about the distinction between data scientists and explicit BI roles at average companies. A data scientist will usually be responsible for the more technical, research- intensive projects in an analytics organization: exploring new data sources, implementing predictive analytics, performing hypothesis tests, researching new machine learning models, etc. However, much of what they do is still focused on, or in service to, what we can broadly define as business intelligence. Or at least, this used to be the case.
Comments 0
Loading comments...
Reply to Comment
Edit Comment