Statistics
13
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2026-01-08

AuthorPeter Bruce, Andrew Bruce, Peter Gedeck

Statistical methods are a key part of data science, yet few data scientists have formal statistical training. Courses and books on basic statistics rarely cover the topic from a data science perspective. The second edition of this popular guide adds comprehensive examples in Python, provides practical guidance on applying statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what’s important and what’s not. Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R or Python programming languages and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format. With this book, you’ll learn: Why exploratory data analysis is a key preliminary step in data science How random sampling can reduce bias and yield a higher-quality dataset, even with big data How the principles of experimental design yield definitive answers to questions How to use regression to estimate outcomes and detect anomalies Key classification techniques for predicting which categories a record belongs to Statistical machine learning methods that "learn" from data Unsupervised learning methods for extracting meaning from unlabeled data

Tags
No tags
ISBN: 149207294X
Publisher: O'Reilly Media
Publish Year: 2020
Language: 英文
Pages: 363
File Format: PDF
File Size: 16.0 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

Peter Bruce, Andrew Bruce & Peter Gedeck Second Edition Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python
(This page has no text content)
Peter Bruce, Andrew Bruce, and Peter Gedeck Practical Statistics for Data Scientists 50+ Essential Concepts Using R and Python SECOND EDITION Boston Farnham Sebastopol TokyoBeijing
978-1-492-07294-2 [LSI] Practical Statistics for Data Scientists by Peter Bruce, Andrew Bruce, and Peter Gedeck Copyright © 2020 Peter Bruce, Andrew Bruce, and Peter Gedeck. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Nicole Tache Production Editor: Kristen Brown Copyeditor: Piper Editorial Proofreader: Arthur Johnson Indexer: Ellen Troutman-Zaig Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest May 2017: First Edition May 2020: Second Edition Revision History for the Second Edition 2020-04-10: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492072942 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Practical Statistics for Data Scientists, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Peter Bruce and Andrew Bruce would like to dedicate this book to the memories of our parents, Victor G. Bruce and Nancy C. Bruce, who cultivated a passion for math and science; and to our early mentors John W. Tukey and Julian Simon and our lifelong friend Geoff Watson, who helped inspire us to pursue a career in statistics. Peter Gedeck would like to dedicate this book to Tim Clark and Christian Kramer, with deep thanks for their scientific collaboration and friendship.
(This page has no text content)
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1. Exploratory Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Elements of Structured Data 2 Further Reading 4 Rectangular Data 4 Data Frames and Indexes 6 Nonrectangular Data Structures 6 Further Reading 7 Estimates of Location 7 Mean 9 Median and Robust Estimates 10 Example: Location Estimates of Population and Murder Rates 12 Further Reading 13 Estimates of Variability 13 Standard Deviation and Related Estimates 14 Estimates Based on Percentiles 16 Example: Variability Estimates of State Population 18 Further Reading 19 Exploring the Data Distribution 19 Percentiles and Boxplots 20 Frequency Tables and Histograms 22 Density Plots and Estimates 24 Further Reading 26 Exploring Binary and Categorical Data 27 Mode 29 Expected Value 29 Probability 30 v
Further Reading 30 Correlation 30 Scatterplots 34 Further Reading 36 Exploring Two or More Variables 36 Hexagonal Binning and Contours (Plotting Numeric Versus Numeric Data) 36 Two Categorical Variables 39 Categorical and Numeric Data 41 Visualizing Multiple Variables 43 Further Reading 46 Summary 46 2. Data and Sampling Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Random Sampling and Sample Bias 48 Bias 50 Random Selection 51 Size Versus Quality: When Does Size Matter? 52 Sample Mean Versus Population Mean 53 Further Reading 53 Selection Bias 54 Regression to the Mean 55 Further Reading 57 Sampling Distribution of a Statistic 57 Central Limit Theorem 60 Standard Error 60 Further Reading 61 The Bootstrap 61 Resampling Versus Bootstrapping 65 Further Reading 65 Confidence Intervals 65 Further Reading 68 Normal Distribution 69 Standard Normal and QQ-Plots 71 Long-Tailed Distributions 73 Further Reading 75 Student’s t-Distribution 75 Further Reading 78 Binomial Distribution 78 Further Reading 80 Chi-Square Distribution 80 Further Reading 81 F-Distribution 82 vi | Table of Contents
Further Reading 82 Poisson and Related Distributions 82 Poisson Distributions 83 Exponential Distribution 84 Estimating the Failure Rate 84 Weibull Distribution 85 Further Reading 86 Summary 86 3. Statistical Experiments and Significance Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A/B Testing 88 Why Have a Control Group? 90 Why Just A/B? Why Not C, D,…? 91 Further Reading 92 Hypothesis Tests 93 The Null Hypothesis 94 Alternative Hypothesis 95 One-Way Versus Two-Way Hypothesis Tests 95 Further Reading 96 Resampling 96 Permutation Test 97 Example: Web Stickiness 98 Exhaustive and Bootstrap Permutation Tests 102 Permutation Tests: The Bottom Line for Data Science 102 Further Reading 103 Statistical Significance and p-Values 103 p-Value 106 Alpha 107 Type 1 and Type 2 Errors 109 Data Science and p-Values 109 Further Reading 110 t-Tests 110 Further Reading 112 Multiple Testing 112 Further Reading 116 Degrees of Freedom 116 Further Reading 118 ANOVA 118 F-Statistic 121 Two-Way ANOVA 123 Further Reading 124 Chi-Square Test 124 Table of Contents | vii
Chi-Square Test: A Resampling Approach 124 Chi-Square Test: Statistical Theory 127 Fisher’s Exact Test 128 Relevance for Data Science 130 Further Reading 131 Multi-Arm Bandit Algorithm 131 Further Reading 134 Power and Sample Size 135 Sample Size 136 Further Reading 138 Summary 139 4. Regression and Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Simple Linear Regression 141 The Regression Equation 143 Fitted Values and Residuals 146 Least Squares 148 Prediction Versus Explanation (Profiling) 149 Further Reading 150 Multiple Linear Regression 150 Example: King County Housing Data 151 Assessing the Model 153 Cross-Validation 155 Model Selection and Stepwise Regression 156 Weighted Regression 159 Further Reading 161 Prediction Using Regression 161 The Dangers of Extrapolation 161 Confidence and Prediction Intervals 161 Factor Variables in Regression 163 Dummy Variables Representation 164 Factor Variables with Many Levels 167 Ordered Factor Variables 169 Interpreting the Regression Equation 169 Correlated Predictors 170 Multicollinearity 172 Confounding Variables 172 Interactions and Main Effects 174 Regression Diagnostics 176 Outliers 177 Influential Values 179 Heteroskedasticity, Non-Normality, and Correlated Errors 182 viii | Table of Contents
Partial Residual Plots and Nonlinearity 185 Polynomial and Spline Regression 187 Polynomial 188 Splines 189 Generalized Additive Models 192 Further Reading 193 Summary 194 5. Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Naive Bayes 196 Why Exact Bayesian Classification Is Impractical 197 The Naive Solution 198 Numeric Predictor Variables 200 Further Reading 201 Discriminant Analysis 201 Covariance Matrix 202 Fisher’s Linear Discriminant 203 A Simple Example 204 Further Reading 207 Logistic Regression 208 Logistic Response Function and Logit 208 Logistic Regression and the GLM 210 Generalized Linear Models 212 Predicted Values from Logistic Regression 212 Interpreting the Coefficients and Odds Ratios 213 Linear and Logistic Regression: Similarities and Differences 214 Assessing the Model 216 Further Reading 219 Evaluating Classification Models 219 Confusion Matrix 221 The Rare Class Problem 223 Precision, Recall, and Specificity 223 ROC Curve 224 AUC 226 Lift 228 Further Reading 229 Strategies for Imbalanced Data 230 Undersampling 231 Oversampling and Up/Down Weighting 232 Data Generation 233 Cost-Based Classification 234 Exploring the Predictions 234 Table of Contents | ix
Further Reading 236 Summary 236 6. Statistical Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 K-Nearest Neighbors 238 A Small Example: Predicting Loan Default 239 Distance Metrics 241 One Hot Encoder 242 Standardization (Normalization, z-Scores) 243 Choosing K 246 KNN as a Feature Engine 247 Tree Models 249 A Simple Example 250 The Recursive Partitioning Algorithm 252 Measuring Homogeneity or Impurity 254 Stopping the Tree from Growing 256 Predicting a Continuous Value 257 How Trees Are Used 258 Further Reading 259 Bagging and the Random Forest 259 Bagging 260 Random Forest 261 Variable Importance 265 Hyperparameters 269 Boosting 270 The Boosting Algorithm 271 XGBoost 272 Regularization: Avoiding Overfitting 274 Hyperparameters and Cross-Validation 279 Summary 282 7. Unsupervised Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Principal Components Analysis 284 A Simple Example 285 Computing the Principal Components 288 Interpreting Principal Components 289 Correspondence Analysis 292 Further Reading 294 K-Means Clustering 294 A Simple Example 295 K-Means Algorithm 298 Interpreting the Clusters 299 x | Table of Contents
Selecting the Number of Clusters 302 Hierarchical Clustering 304 A Simple Example 305 The Dendrogram 306 The Agglomerative Algorithm 308 Measures of Dissimilarity 309 Model-Based Clustering 311 Multivariate Normal Distribution 311 Mixtures of Normals 312 Selecting the Number of Clusters 315 Further Reading 318 Scaling and Categorical Variables 318 Scaling the Variables 319 Dominant Variables 321 Categorical Data and Gower’s Distance 322 Problems with Clustering Mixed Data 325 Summary 326 Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Table of Contents | xi
(This page has no text content)
Preface This book is aimed at the data scientist with some familiarity with the R and/or Python programming languages, and with some prior (perhaps spotty or ephemeral) exposure to statistics. Two of the authors came to the world of data science from the world of statistics, and have some appreciation of the contribution that statistics can make to the art of data science. At the same time, we are well aware of the limitations of traditional statistics instruction: statistics as a discipline is a century and a half old, and most statistics textbooks and courses are laden with the momentum and inertia of an ocean liner. All the methods in this book have some connection—historical or methodological—to the discipline of statistics. Methods that evolved mainly out of computer science, such as neural nets, are not included. Two goals underlie this book: • To lay out, in digestible, navigable, and easily referenced form, key concepts from statistics that are relevant to data science. • To explain which concepts are important and useful from a data science perspec‐ tive, which are less so, and why. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. xiii
Constant width bold Shows commands or other text that should be typed literally by the user. Key Terms Data science is a fusion of multiple disciplines, including statistics, computer science, information technology, and domain-specific fields. As a result, several different terms could be used to reference a given concept. Key terms and their synonyms will be highlighted throughout the book in a sidebar such as this. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Using Code Examples In all cases, this book gives code examples first in R and then in Python. In order to avoid unnecessary repetition, we generally show only output and plots created by the R code. We also skip the code required to load the required packages and data sets. You can find the complete code as well as the data sets for download at https:// github.com/gedeck/practical-statistics-for-data-scientists. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of xiv | Preface
example code from this book into your product’s documentation does require per‐ mission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Practical Statistics for Data Scientists by Peter Bruce, Andrew Bruce, and Peter Gedeck (O’Reilly). Copyright 2020 Peter Bruce, Andrew Bruce, and Peter Gedeck, 978-1-492-07294-2.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/practicalStats_dataSci_2e. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and more information about our books and courses, see our website at http://oreilly.com. Preface | xv
Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments The authors acknowledge the many people who helped make this book a reality. Gerhard Pilcher, CEO of the data mining firm Elder Research, saw early drafts of the book and gave us detailed and helpful corrections and comments. Likewise, Anya McGuirk and Wei Xiao, statisticians at SAS, and Jay Hilfiger, fellow O’Reilly author, provided helpful feedback on initial drafts of the book. Toshiaki Kurokawa, who translated the first edition into Japanese, did a comprehensive job of reviewing and correcting in the process. Aaron Schumacher and Walter Paczkowski thoroughly reviewed the second edition of the book and provided numerous helpful and valuable suggestions for which we are extremely grateful. Needless to say, any errors that remain are ours alone. At O’Reilly, Shannon Cutt has shepherded us through the publication process with good cheer and the right amount of prodding, while Kristen Brown smoothly took our book through the production phase. Rachel Monaghan and Eliahu Sussman cor‐ rected and improved our writing with care and patience, while Ellen Troutman-Zaig prepared the index. Nicole Tache took over the reins for the second edition and has both guided the process effectively and provided many good editorial suggestions to improve the readability of the book for a broad audience. We also thank Marie Beau‐ gureau, who initiated our project at O’Reilly, as well as Ben Bengfort, O’Reilly author and Statistics.com instructor, who introduced us to O’Reilly. We, and this book, have also benefited from the many conversations Peter has had over the years with Galit Shmueli, coauthor on other book projects. Finally, we would like to especially thank Elizabeth Bruce and Deborah Donnell, whose patience and support made this endeavor possible. xvi | Preface
CHAPTER 1 Exploratory Data Analysis This chapter focuses on the first step in any data science project: exploring the data. Classical statistics focused almost exclusively on inference, a sometimes complex set of procedures for drawing conclusions about large populations based on small sam‐ ples. In 1962, John W. Tukey (Figure 1-1) called for a reformation of statistics in his seminal paper “The Future of Data Analysis” [Tukey-1962]. He proposed a new scien‐ tific discipline called data analysis that included statistical inference as just one com‐ ponent. Tukey forged links to the engineering and computer science communities (he coined the terms bit, short for binary digit, and software), and his original tenets are surprisingly durable and form part of the foundation for data science. The field of exploratory data analysis was established with Tukey’s 1977 now-classic book Explor‐ atory Data Analysis [Tukey-1977]. Tukey presented simple plots (e.g., boxplots, scat‐ terplots) that, along with summary statistics (mean, median, quantiles, etc.), help paint a picture of a data set. With the ready availability of computing power and expressive data analysis software, exploratory data analysis has evolved well beyond its original scope. Key drivers of this discipline have been the rapid development of new technology, access to more and bigger data, and the greater use of quantitative analysis in a variety of disciplines. David Donoho, professor of statistics at Stanford University and former undergradu‐ ate student of Tukey’s, authored an excellent article based on his presentation at the Tukey Centennial workshop in Princeton, New Jersey [Donoho-2015]. Donoho traces the genesis of data science back to Tukey’s pioneering work in data analysis. 1
Figure 1-1. John Tukey, the eminent statistician whose ideas developed over 50 years ago form the foundation of data science Elements of Structured Data Data comes from many sources: sensor measurements, events, text, images, and vid‐ eos. The Internet of Things (IoT) is spewing out streams of information. Much of this data is unstructured: images are a collection of pixels, with each pixel containing RGB (red, green, blue) color information. Texts are sequences of words and nonword char‐ acters, often organized by sections, subsections, and so on. Clickstreams are sequen‐ ces of actions by a user interacting with an app or a web page. In fact, a major challenge of data science is to harness this torrent of raw data into actionable infor‐ mation. To apply the statistical concepts covered in this book, unstructured raw data must be processed and manipulated into a structured form. One of the commonest forms of structured data is a table with rows and columns—as data might emerge from a relational database or be collected for a study. There are two basic types of structured data: numeric and categorical. Numeric data comes in two forms: continuous, such as wind speed or time duration, and discrete, such as the count of the occurrence of an event. Categorical data takes only a fixed set of values, such as a type of TV screen (plasma, LCD, LED, etc.) or a state name (Ala‐ bama, Alaska, etc.). Binary data is an important special case of categorical data that takes on only one of two values, such as 0/1, yes/no, or true/false. Another useful type of categorical data is ordinal data in which the categories are ordered; an example of this is a numerical rating (1, 2, 3, 4, or 5). Why do we bother with a taxonomy of data types? It turns out that for the purposes of data analysis and predictive modeling, the data type is important to help determine the type of visual display, data analysis, or statistical model. In fact, data science software, such as R and Python, uses these data types to improve computational per‐ formance. More important, the data type for a variable determines how software will handle computations for that variable. 2 | Chapter 1: Exploratory Data Analysis