Statistics
67
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2025-11-22

AuthorDaniel Vaughan

This practical guide provides a collection of techniques and best practices that are generally overlooked in most data engineering and data science pedagogy. A common misconception is that great data scientists are experts in the "big themes" of the discipline—machine learning and programming. But most of the time, these tools can only take us so far. In practice, the smaller tools and skills really separate a great data scientist from a not-so-great one. Taken as a whole, the lessons in this book make the difference between an average data scientist candidate and a qualified data scientist working in the field. Author Daniel Vaughan has collected, extended, and used these skills to create value and train data scientists from different companies and industries. With this book, you will: Understand how data science creates value Deliver compelling narratives to sell your data science project Build a business case using unit economics principles Create new features for a ML model using storytelling Learn how to decompose KPIs Perform growth decompositions to find root causes for changes in a metric Daniel Vaughan is head of data at Clip, the leading paytech company in Mexico. He's the author of Analytical Skills for AI and Data Science (O'Reilly).

Tags
No tags
ISBN: 1098146476
Publisher: O'Reilly Media
Publish Year: 2023
Language: 英文
Pages: 257
File Format: PDF
File Size: 2.9 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

Daniel Vaughan Data Science: The Hard Parts Techniques for Excelling at Data Science
DATA Data Science: The Hard Parts Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia This hands-on guide offers a set of techniques and best practices that are often missed in conventional data engineering and data science education. A common misconception is that great data scientists are experts in the “big themes” of the discipline, namely ML and programming. But most of the time, these tools can only take us so far. In reality, it’s the nuances within these larger themes, and the ability to impact the business, that truly distinguish a top-notch data scientist from an average one. Taken as a whole, the lessons in this book make the difference between an average data scientist candidate and an exceptional data scientist working in the field. Author Daniel Vaughan has collected, extended, and used these skills to create value and train data scientists from different companies and industries. With this book, you will: • Ensure that your data science workflow creates value • Design actionable, timely, and relevant metrics • Deliver compelling narratives to gain stakeholder buy-in • Use simulation to ensure that your ML algorithm is the right tool for the problem • Identify, correct, and prevent data leakage • Understand incrementality by estimating causal effects Daniel Vaughan has led data teams across different companies and sectors and is currently advising several fintech companies on how to ensure the success of their data, ML, and AI initiatives. Author of Analytical Skills for AI and Data Science (O’Reilly), he has more than 15 years of experience developing machine learning and more than eight years leading data science teams. Daniel holds a PhD in economics from NYU. US $65.99 CAN $82.99 ISBN: 978-1-098-14647-4 “Daniel has written another masterpiece to serve as the connective tissue for value creation between data scientists and business executives. This book is the missing manual for commercial success from data science.” —Adri Purkayastha Global Head of AI Technology Risk, BNP Paribas “Covers everything from economics to advertising to epidemiology and how to apply data science techniques in practice. It starts where most books end—with the actual decision-making processes driven by data insights. A long-overdue addition to any data scientist’s bookshelf.” —Brett Holleman Freelance data scientist
Daniel Vaughan Data Science: The Hard Parts Techniques for Excelling at Data Science Boston Farnham Sebastopol TokyoBeijing
978-1-098-14647-4 [LSI] Data Science: The Hard Parts by Daniel Vaughan Copyright © 2024 Daniel Vaughan. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Aaron Black Development Editor: Corbin Collins Production Editor: Jonathon Owen Copyeditor: Sonia Saruba Proofreader: Piper Editorial Consulting, LLC Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea November 2023: First Edition Revision History for the First Edition 2023-10-31: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098146474 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Science: The Hard Parts, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
This book is dedicated to my brother Nicolas, whom I love and admire very much.
(This page has no text content)
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Part I. Data Analytics Techniques 1. So What? Creating Value with Data Science. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What Is Value? 3 What: Understanding the Business 4 So What: The Gist of Value Creation in DS 6 Now What: Be a Go-Getter 7 Measuring Value 7 Key Takeaways 9 Further Reading 10 2. Metrics Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Desirable Properties That Metrics Should Have 11 Measurable 11 Actionable 12 Relevance 12 Timeliness 12 Metrics Decomposition 13 Funnel Analytics 13 Stock-Flow Decompositions 14 P×Q-Type Decompositions 15 Example: Another Revenue Decomposition 15 Example: Marketplaces 15 Key Takeaways 16 Further Reading 17 v
3. Growth Decompositions: Understanding Tailwinds and Headwinds. . . . . . . . . . . . . . . 19 Why Growth Decompositions? 19 Additive Decomposition 20 Example 20 Interpretation and Use Cases 21 Multiplicative Decomposition 22 Example 23 Interpretation 23 Mix-Rate Decompositions 24 Example 25 Interpretation 26 Mathematical Derivations 26 Additive Decomposition 27 Multiplicative Decomposition 27 Mix-Rate Decomposition 28 Key Takeaways 28 Further Reading 29 4. 2×2 Designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 The Case for Simplification 31 What’s a 2×2 Design? 32 Example: Test a Model and a New Feature 33 Example: Understanding User Behavior 35 Example: Credit Origination and Acceptance 37 Example: Prioritizing Your Workflow 38 Key Takeaways 39 Further Reading 40 5. Building Business Cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Some Principles to Construct Business Cases 41 Example: Proactive Retention Strategy 42 Fraud Prevention 43 Purchasing External Datasets 44 Working on a Data Science Project 45 Key Takeaways 46 Further Reading 46 6. What’s in a Lift?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Lifts Defined 47 Example: Classifier Model 48 Self-Selection and Survivorship Biases 49 Other Use Cases for Lifts 50 vi | Table of Contents
Key Takeaways 51 Further Reading 51 7. Narratives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 What’s in a Narrative: Telling a Story with Your Data 53 Clear and to the Point 54 Credible 55 Memorable 56 Actionable 57 Building a Narrative 57 Science as Storytelling 57 What, So What, and Now What? 59 The Last Mile 60 Writing TL;DRs 60 Tips to Write Memorable TL;DRs 61 Example: Writing a TL;DR for This Chapter 61 Delivering Powerful Elevator Pitches 63 Presenting Your Narrative 64 Key Takeaways 65 Further Reading 65 8. Datavis: Choosing the Right Plot to Deliver a Message. . . . . . . . . . . . . . . . . . . . . . . . . . 67 Some Useful and Not-So-Used Data Visualizations 67 Bar Versus Line Plots 67 Slopegraphs 69 Waterfall Charts 70 Scatterplot Smoothers 71 Plotting Distributions 72 General Recommendations 73 Find the Right Datavis for Your Message 73 Choose Your Colors Wisely 74 Different Dimensions in a Plot 75 Aim for a Large Enough Data-Ink Ratio 75 Customization Versus Semiautomation 76 Get the Font Size Right from the Beginning 76 Interactive or Not 77 Stay Simple 77 Start by Explaining the Plot 77 Key Takeaways 78 Further Reading 78 Table of Contents | vii
Part II. Machine Learning 9. Simulation and Bootstrapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Basics of Simulation 82 Simulating a Linear Model and Linear Regression 84 What Are Partial Dependence Plots? 87 Omitted Variable Bias 91 Simulating Classification Problems 94 Latent Variable Models 94 Comparing Different Algorithms 96 Bootstrapping 97 Key Takeaways 100 Further Reading 100 10. Linear Regression: Going Back to Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 What’s in a Coefficient? 103 The Frisch-Waugh-Lovell Theorem 106 Why Should You Care About FWL? 109 Confounders 110 Additional Variables 112 The Central Role of Variance in ML 114 Key Takeaways 118 Further Reading 119 11. Data Leakage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 What Is Data Leakage? 121 Outcome Is Also a Feature 122 A Function of the Outcome Is Itself a Feature 122 Bad Controls 122 Mislabeling of a Timestamp 123 Multiple Datasets with Sloppy Time Aggregations 123 Leakage of Other Information 124 Detecting Data Leakage 124 Complete Separation 126 Windowing Methodology 128 Choosing the Length of the Windows 130 The Training Stage Mirrors the Scoring Stage 131 Implementing the Windowing Methodology 131 I Have Leakage: Now What? 132 Key Takeaways 133 Further Reading 134 viii | Table of Contents
12. Productionizing Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 What Does “Production Ready” Mean? 135 Batch Scores (Offline) 136 Real-Time Model Objects 138 Data and Model Drift 138 Essential Steps in any Production Pipeline 140 Get and Transform Data 141 Validate Data 142 Training and Scoring Stages 143 Validate Model and Scores 143 Deploy Model and Scores 144 Key Takeaways 144 Further Reading 145 13. Storytelling in Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 A Holistic View of Storytelling in ML 147 Ex Ante and Interim Storytelling 148 Creating Hypotheses 149 Feature Engineering 152 Ex Post Storytelling: Opening the Black Box 154 Interpretability-Performance Trade-Off 155 Linear Regression: Setting a Benchmark 156 Feature Importance 158 Heatmaps 160 Partial Dependence Plots 162 Accumulated Local Effects 164 Key Takeaways 166 Further Reading 166 14. From Prediction to Decisions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Dissecting Decision Making 169 Simple Decision Rules by Smart Thresholding 171 Precision and Recall 172 Example: Lead Generation 173 Confusion Matrix Optimization 175 Key Takeaways 177 Further Reading 178 15. Incrementality: The Holy Grail of Data Science?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Defining Incrementality 179 Causal Reasoning to Improve Prediction 180 Causal Reasoning as a Differentiator 180 Table of Contents | ix
Improved Decision Making 181 Confounders and Colliders 181 Selection Bias 185 Unconfoundedness Assumption 188 Breaking Selection Bias: Randomization 189 Matching 191 Machine Learning and Causal Inference 194 Open Source Codebases 194 Double Machine Learning 196 Key Takeaways 197 Further Reading 198 16. A/B Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 What Is an A/B Test? 201 Decision Criterion 202 Minimum Detectable Effects 205 Choosing the Statistical Power, Level, and P 208 Estimating the Variance of the Outcome 209 Simulations 209 Example: Conversion Rates 211 Setting the MDE 212 Hypotheses Backlog 213 Metric 213 Hypothesis 214 Ranking 214 Governance of Experiments 214 Key Takeaways 215 Further Reading 216 17. Large Language Models and the Practice of Data Science. . . . . . . . . . . . . . . . . . . . . . . 219 The Current State of AI 219 What Do Data Scientists Do? 221 Evolving the Data Scientist’s Job Description 223 Case Study: A/B Testing 225 Case Study: Data Cleansing 225 Case Study: Machine Learning 226 LLMs and This Book 226 Key Takeaways 228 Further Reading 228 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 x | Table of Contents
Preface I’ll posit that learning and practicing data science is hard. It is hard because you are expected to be a great programmer who not only knows the intricacies of data struc‐ tures and their computational complexity but is also well versed in Python and SQL. Statistics and the latest machine learning predictive techniques ought to be a second language to you, and naturally you need to be able to apply all of these to solve actual business problems that may arise. But the job is also hard because you have to be a great communicator who tells compelling stories to nontechnical stakeholders who may not be used to making decisions in a data-driven way. So let’s be honest: it’s almost self-evident that the theory and practice of data science is hard. And any book that aims at covering the hard parts of data science is either ency‐ clopedic and exhaustive, or must go through a preselection process that filters out some topics. I must acknowledge at the outset that this is a selection of topics that I consider the hard parts to learn in data science, and that this label is subjective by nature. To make it less so, I’ll pose that it’s not that they’re harder to learn because of their complexity, but rather that at this point in time, the profession has put a low enough weight on these as entry topics to have a career in data science. So in practice, they are harder to learn because it’s hard to find material on them. The data science curriculum usually emphasizes learning programming and machine learning, what I call the big themes in data science. Almost everything else is expected to be learned on the job, and unfortunately, it really matters if you’re lucky enough to find a mentor where you land your first or second job. Large tech companies are great because they have an equally large talent density, so many of these somewhat under‐ ground topics become part of local company subcultures, unavailable to many practitioners. This book is about techniques that will help you become a more productive data sci‐ entist. I’ve divided it into two parts: Part I treats topics in data analytics and on the softer side of data science, and Part II is all about machine learning (ML). xi
While it can be read in any order without creating major friction, there are instances of chapters that make references to previous chapters; most of the time you can skip the reference, and the material will remain clear and self-explanatory. References are mostly used to provide a sense of unity across seemingly independent topics. Part I covers the following topics: Chapter 1, “So What? Creating Value with Data Science” What is the role of data science in creating value for the organization, and how do you measure it? Chapter 2, “Metrics Design” I argue that data scientists are best suited to improve on the design of actionable metrics. Here I show you how to do it. Chapter 3, “Growth Decompositions: Understanding Tailwinds and Headwinds” Understanding what’s going on with the business and coming up with a compel‐ ling narrative is a common ask for data scientists. This chapter introduces some growth decompositions that can be used to automate part of this workflow. Chapter 4, “2×2 Designs” Learning to simplify the world can take you a long way, and 2×2 designs will help you achieve that, as well as help you improve your communication with your stakeholders. Chapter 5, “Building Business Cases” Before starting a project, you should have a business case. This chapter shows you how to do it. Chapter 6, “What’s in a Lift?” As simple as they are, lifts can speed up analyses that you might’ve considered doing with machine learning. I explain lifts in this chapter. Chapter 7, “Narratives” Data scientists need to become better at storytelling and structuring compelling narratives. Here I show you how. Chapter 8, “Datavis: Choosing the Right Plot to Deliver a Message” Investing enough time on your data visualizations should also help you with your narrative. This chapter discusses some best practices. Part II is about ML and covers the following topics: Chapter 9, “Simulation and Bootstrapping” Simulation techniques can help you strengthen your understanding of different prediction algorithms. I show you how, along with some caveats of using your xii | Preface
favorite regression and classification techniques. I also discuss bootstrapping that can be used to find confidence intervals of some hard-to-compute estimands. Chapter 10, “Linear Regression: Going Back to Basics” Having some deep knowledge of linear regression is critical to understanding some more advanced topics. In this chapter I go back to basics, hoping to provide a stronger intuitive foundation of machine learning algorithms. Chapter 11, “Data Leakage” What is data leakage, and how can you identify it and prevent it? This chapter shows how. Chapter 12, “Productionizing Models” A model is only good if it reaches the production stage. Fortunately, this is a well- understood and structured problem, and I show the most critical of these steps. Chapter 13, “Storytelling in Machine Learning” There are some great techniques you can use to open the black box and excel at storytelling in ML. Chapter 14, “From Prediction to Decisions” We create value from improving our decision-making capabilities through data- and ML-driven processes. Here I show you examples of how to move from pre‐ diction to decision. Chapter 15, “Incrementality: The Holy Grail of Data Science?” Causality has gained some momentum in data science, but it’s still considered somewhat of a niche. In this chapter I go through the basics, and provide some examples and code that can be readily applied in your organization. Chapter 16, “A/B Tests” A/B tests are the archetypical example of how to estimate the incrementality of alternative courses of action. But experiments require some strong background knowledge of statistics (and the business). The last chapter (Chapter 17) is quite unique because it’s the only one where no tech‐ niques are presented. Here I speculate on the future of data science with the advent of generative artificial intelligence (AI). The main takeaway is that I expect the job description to change radically in the next few years, and data scientists ought to be prepared for this (r)evolution. This book is intended for data scientists of all levels and seniority. To make the most of the book, it’s better if you have some medium-to-advanced knowledge of machine learning algorithms, as I don’t spend any time introducing linear regression, classifi‐ cation and regression trees, or ensemble learners, such as random forests or gradient boosting machines. Preface | xiii
Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://oreil.ly/dshp-repo. If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. xiv | Preface
We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Data Science: The Hard Parts by Daniel Vaughan (O’Reilly). Copyright 2024 Daniel Vaughan, 978-1-098-14647-4.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-829-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://www.oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/data-science-the-hard-parts. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://youtube.com/oreillymedia Preface | xv
Acknowledgments I presented many of the topics covered in the book at Clip’s internal technical semi‐ nars. As such I’m indebted to the amazing data team that I had the honor of leading, mentoring, and learning from. Their expertise and knowledge have been instrumen‐ tal in shaping the content and form of this book. I’m also deeply indebted to my editor, Corbin Collins, who patiently and graciously proofread the manuscript, found mistakes and omissions, and made suggestions that radically improved the presentation in many ways. I would also like to express my sincere appreciation to Jonathon Owen (production editor) and Sonia Saruba (copy‐ editor) for their keen eye and exceptional skills and dedication. Their combined efforts have significantly contributed to the quality of this book, and for that, I am forever thankful. Big thanks to the technical reviewers who found mistakes and typos in the contents and accompanying code of the book, and who also made suggestions to improve the presentation. Special thanks to Naveen Krishnaraj, Brett Holleman, and Chandra Shukla for providing detailed feedback. Many times we did not agree, but their con‐ structive criticism was at the same time humbling and reinforcing. Needless to say, all remaining errors are my own. They will never read this, but I’m forever grateful to my dogs, Matilda and Domingo, for their infinite capacity to provide love, laughter, tenderness, and companionship. I am also grateful to my friends and family for their unconditional support and encouragement. A very special thank-you to Claudia: your loving patience when I kept discussing some of these ideas over and over, even when they made little to no sense to you, cannot be overstated. Finally, I would like to acknowledge the countless researchers and practitioners in data science whose work has inspired and informed my own. This book wouldn’t exist without their dedication and contributions, and I am honored to be a part of this vibrant community. Thank you all for your support. xvi | Preface
PART I Data Analytics Techniques
(This page has no text content)