M A N N I N G Nina Zumel John Mount FOREWORD BY Jim Porzak
Practical Data Science with R
(This page has no text content)
Practical Data Science with R NINA ZUMEL JOHN MOUNT M A N N I N G SHELTER ISLAND
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: orders@manning.com ©2014 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editor: Cynthia Kane 20 Baldwin Road Copyeditor: Benjamin Berg PO Box 261 Proofreader: Katie Tennant Shelter Island, NY 11964 Typesetter: Dottie Marsico Cover designer: Marija Tudor ISBN 9781617291562 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 19 18 17 16 15 14
To our parents Olive and Paul Zumel Peggy and David Mount
(This page has no text content)
brief contents PART 1 INTRODUCTION TO DATA SCIENCE .................................1 1 ■ The data science process 3 2 ■ Loading data into R 18 3 ■ Exploring data 35 4 ■ Managing data 64 PART 2 MODELING METHODS ..................................................81 5 ■ Choosing and evaluating models 83 6 ■ Memorization methods 115 7 ■ Linear and logistic regression 140 8 ■ Unsupervised methods 175 9 ■ Exploring advanced methods 211 PART 3 DELIVERING RESULTS . ...............................................253 10 ■ Documentation and deployment 255 11 ■ Producing effective presentations 287vii
(This page has no text content)
contents foreword xv preface xvii acknowledgments xviii about this book xix about the cover illustration xxv PART 1 INTRODUCTION TO DATA SCIENCE......................1 1 The data science process 3 1.1 The roles in a data science project 3 Project roles 4 1.2 Stages of a data science project 6 Defining the goal 7 ■ Data collection and management 8 Modeling 10 ■ Model evaluation and critique 11 Presentation and documentation 13 ■ Model deployment and maintenance 14 1.3 Setting expectations 14 Determining lower and upper bounds on model performance 15 1.4 Summary 17ix
CONTENTSx 2 Loading data into R 18 2.1 Working with data from files 19 Working with well-structured data from files or URLs 19 Using R on less-structured data 22 2.2 Working with relational databases 24 A production-size example 25 ■ Loading data from a database into R 30 ■ Working with the PUMS data 31 2.3 Summary 34 3 Exploring data 35 3.1 Using summary statistics to spot problems 36 Typical problems revealed by data summaries 38 3.2 Spotting problems using graphics and visualization 41 Visually checking distributions for a single variable 43 Visually checking relationships between two variables 51 3.3 Summary 62 4 Managing data 64 4.1 Cleaning data 64 Treating missing values (NAs) 65 ■ Data transformations 69 4.2 Sampling for modeling and validation 76 Test and training splits 76 ■ Creating a sample group column 77 ■ Record grouping 78 ■ Data provenance 78 4.3 Summary 79 PART 2 MODELING METHODS ......................................81 5 Choosing and evaluating models 83 5.1 Mapping problems to machine learning tasks 84 Solving classification problems 85 ■ Solving scoring problems 87 ■ Working without known targets 88 Problem-to-method mapping 90 5.2 Evaluating models 92 Evaluating classification models 93 ■ Evaluating scoring models 98 ■ Evaluating probability models 101 ■ Evaluating ranking models 105 ■ Evaluating clustering models 105
CONTENTS xi 5.3 Validating models 108 Identifying common model problems 108 ■ Quantifying model soundness 110 ■ Ensuring model quality 111 5.4 Summary 113 6 Memorization methods 115 6.1 KDD and KDD Cup 2009 116 Getting started with KDD Cup 2009 data 117 6.2 Building single-variable models 118 Using categorical features 119 ■ Using numeric features 121 Using cross-validation to estimate effects of overfitting 123 6.3 Building models using many variables 125 Variable selection 125 ■ Using decision trees 127 ■ Using nearest neighbor methods 130 ■ Using Naive Bayes 134 6.4 Summary 138 7 Linear and logistic regression 140 7.1 Using linear regression 141 Understanding linear regression 141 ■ Building a linear regression model 144 ■ Making predictions 145 ■ Finding relations and extracting advice 149 ■ Reading the model summary and characterizing coefficient quality 151 ■ Linear regression takeaways 156 7.2 Using logistic regression 157 Understanding logistic regression 157 ■ Building a logistic regression model 159 ■ Making predictions 160 ■ Finding relations and extracting advice from logistic models 164 Reading the model summary and characterizing coefficients 166 Logistic regression takeaways 173 7.3 Summary 174 8 Unsupervised methods 175 8.1 Cluster analysis 176 Distances 176 ■ Preparing the data 178 ■ Hierarchical clustering with hclust() 180 ■ The k-means algorithm 190 Assigning new points to clusters 195 ■ Clustering takeaways 198
CONTENTSxii 8.2 Association rules 198 Overview of association rules 199 ■ The example problem 200 Mining association rules with the arules package 201 Association rule takeaways 209 8.3 Summary 209 9 Exploring advanced methods 211 9.1 Using bagging and random forests to reduce training variance 212 Using bagging to improve prediction 213 ■ Using random forests to further improve prediction 216 ■ Bagging and random forest takeaways 220 9.2 Using generalized additive models (GAMs) to learn non- monotone relationships 221 Understanding GAMs 221 ■ A one-dimensional regression example 222 ■ Extracting the nonlinear relationships 226 Using GAM on actual data 228 ■ Using GAM for logistic regression 231 ■ GAM takeaways 233 9.3 Using kernel methods to increase data separation 233 Understanding kernel functions 234 ■ Using an explicit kernel on a problem 238 ■ Kernel takeaways 241 9.4 Using SVMs to model complicated decision boundaries 242 Understanding support vector machines 242 ■ Trying an SVM on artificial example data 245 ■ Using SVMs on real data 248 Support vector machine takeaways 251 9.5 Summary 251 PART 3 DELIVERING RESULTS . ...................................253 10 Documentation and deployment 255 10.1 The buzz dataset 256 10.2 Using knitr to produce milestone documentation 258 What is knitr? 258 ■ knitr technical details 261 ■ Using knitr to document the buzz data 262
CONTENTS xiii 10.3 Using comments and version control for running documentation 266 Writing effective comments 266 ■ Using version control to record history 267 ■ Using version control to explore your project 272 Using version control to share work 276 10.4 Deploying models 280 Deploying models as R HTTP services 280 ■ Deploying models by export 283 ■ What to take away 284 10.5 Summary 286 11 Producing effective presentations 287 11.1 Presenting your results to the project sponsor 288 Summarizing the project’s goals 289 ■ Stating the project’s results 290 ■ Filling in the details 292 ■ Making recommendations and discussing future work 294 Project sponsor presentation takeaways 295 11.2 Presenting your model to end users 295 Summarizing the project’s goals 296 ■ Showing how the model fits the users’ workflow 296 ■ Showing how to use the model 299 End user presentation takeaways 300 11.3 Presenting your work to other data scientists 301 Introducing the problem 301 ■ Discussing related work 302 Discussing your approach 302 ■ Discussing results and future work 303 ■ Peer presentation takeaways 304 11.4 Summary 304 appendix A Working with R and other tools 307 appendix B Important statistical concepts 333 appendix C More tools and ideas worth exploring 369 bibliography 375 index 377
(This page has no text content)
foreword If you’re a beginning data scientist, or want to be one, Practical Data Science with R (PDSwR) is the place to start. If you’re already doing data science, PDSwR will fill in gaps in your knowledge and even give you a fresh look at tools you use on a daily basis—it did for me. While there are many excellent books on statistics and modeling with R, and a few good management books on applying data science in your organization, this book is unique in that it combines solid technical content with practical, down-to-earth advice on how to practice the craft. I would expect no less from Nina and John. I first met John when he presented at an early Bay Area R Users Group about his joys and frustrations with R. Since then, Nina, John, and I have collaborated on a cou- ple of projects for my former employer. And John has presented early ideas from PDSwR—both to the “big” group and our Berkeley R-Beginners meetup. Based on his experience as a practicing data scientist, John is outspoken and has strong views about how to do things. PDSwR reflects Nina and John’s definite views on how to do data sci- ence—what tools to use, the process to follow, the important methods, and the impor- tance of interpersonal communications. There are no ambiguities in PDSwR. This, as far as I’m concerned, is perfectly fine, especially since I agree with 98% of their views. (My only quibble is around SQL—but that’s more an issue of my upbring- ing than of disagreement.) What their unambiguous writing means is that you can focus on the craft and art of data science and not be distracted by choices of which tools and methods to use. This precision is what makes PDSwR practical. Let’s look at some specifics. Practical tool set: R is a given. In addition, RStudio is the IDE of choice; I’ve beenxv using RStudio since it came out. It has evolved into a remarkable tool—integrated
FOREWORDxvi debugging is in the latest version. The third major tool choice in PDSwR is Hadley Wickham’s ggplot2. While R has traditionally included excellent graphics and visual- ization tools, ggplot2 takes R visualization to the next level. (My practical hint: take a close look at any of Hadley’s R packages, or those of his students.) In addition to those main tools, PDSwR introduces necessary secondary tools: a proper SQL DBMS for larger datasets; Git and GitHub for source code version control; and knitr for docu- mentation generation. Practical datasets: The only way to learn data science is by doing it. There’s a big leap from the typical teaching datasets to the real world. PDSwR strikes a good balance between the need for a practical (simple) dataset for learning and the messiness of the real world. PDSwR walks you through how to explore a new dataset to find prob- lems in the data, cleaning and transforming when necessary. Practical human relations: Data science is all about solving real-world problems for your client—either as a consultant or within your organization. In either case, you’ll work with a multifaceted group of people, each with their own motivations, skills, and responsibilities. As practicing consultants, Nina and John understand this well. PDSwR is unique in stressing the importance of understanding these roles while working through your data science project. Practical modeling: The bulk of PDSwR is about modeling, starting with an excel- lent overview of the modeling process, including how to pick the modeling method to use and, when done, gauge the model’s quality. The book walks you through the most practical modeling methods you’re likely to need. The theory behind each method is intuitively explained. A specific example is worked through—the code and data are available on the authors’ GitHub site. Most importantly, tricks and traps are covered. Each section ends with practical takeaways. In short, Practical Data Science with R is a unique and important addition to any data scientist’s library. JIM PORZAK SENIOR DATA SCIENTIST AND COFOUNDER OF THE BAY AREA R USERS GROUP
preface This is the book we wish we’d had when we were teaching ourselves that collection of subjects and skills that has come to be referred to as data science. It’s the book that we’d like to hand out to our clients and peers. Its purpose is to explain the relevant parts of statistics, computer science, and machine learning that are crucial to data science. Data science draws on tools from the empirical sciences, statistics, reporting, ana- lytics, visualization, business intelligence, expert systems, machine learning, databases, data warehousing, data mining, and big data. It’s because we have so many tools that we need a discipline that covers them all. What distinguishes data science itself from the tools and techniques is the central goal of deploying effective decision-making models to a production environment. Our goal is to present data science from a pragmatic, practice-oriented viewpoint. We’ve tried to achieve this by concentrating on fully worked exercises on real data— altogether, this book works through over 10 significant datasets. We feel that this approach allows us to illustrate what we really want to teach and to demonstrate all the preparatory steps necessary to any real-world project. Throughout our text, we discuss useful statistical and machine learning concepts, include concrete code examples, and explore partnering with and presenting to non- specialists. We hope if you don’t find one of these topics novel, that we’re able to shine a light on one or two other topics that you may not have thought about recently.xvii
acknowledgments We wish to thank all the many reviewers, colleagues, and others who have read and commented on our early chapter drafts, especially Aaron Colcord, Aaron Schumacher, Ambikesh Jayal, Bryce Darling, Dwight Barry, Fred Rahmanian, Hans Donner, Jeelani Basha, Justin Fister, Dr. Kostas Passadis, Leo Polovets, Marius Butuc, Nathanael Adams, Nezih Yigitbasi, Pablo Vaselli, Peter Rabinovitch, Ravishankar Rajagopalan, Rodrigo Abreu, Romit Singhai, Sampath Chaparala, and Zekai Otles. Their comments, ques- tions, and corrections have greatly improved this book. Special thanks to George Gaines for his thorough technical review of the manuscript shortly before it went into production. We especially would like to thank our development editor, Cynthia Kane, for all her advice and patience as she shepherded us through the writing process. The same thanks go to Benjamin Berg, Katie Tennant, Kevin Sullivan, and all the other editors at Manning who worked hard to smooth out the rough patches and technical glitches in our text. In addition, we’d like to thank our colleague David Steier, Professors Anno Saxe- nian and Doug Tygar from UC Berkeley’s School of Information Science, as well as all the other faculty and instructors who have reached out to us about the possibility of using this book as a teaching text. We’d also like to thank Jim Porzak for inviting one of us (John Mount) to speak at the Bay Area R Users Group, for being an enthusiastic advocate of our book, and for contributing the foreword. On days when we were tired and discouraged and won- dered why we had set ourselves to this task, his interest helped remind us that there’s a need for what we’re offering and for the way that we’re offering it. Without hisxviii encouragement, completing this book would have been much harder.
about this book This book is about data science: a field that uses results from statistics, machine learn- ing, and computer science to create predictive models. Because of the broad nature of data science, it’s important to discuss it a bit and to outline the approach we take in this book. What is data science? The statistician William S. Cleveland defined data science as an interdisciplinary field larger than statistics itself. We define data science as managing the process that can transform hypotheses and data into actionable predictions. Typical predictive analytic goals include predicting who will win an election, what products will sell well together, which loans will default, or which advertisements will be clicked on. The data scientist is responsible for acquiring the data, managing the data, choosing the modeling tech- nique, writing the code, and verifying the results. Because data science draws on so many disciplines, it’s often a “second calling.” Many of the best data scientists we meet started as programmers, statisticians, business intelligence analysts, or scientists. By adding a few more techniques to their reper- toire, they became excellent data scientists. That observation drives this book: we introduce the practical skills needed by the data scientist by concretely working through all of the common project steps on real data. Some steps you’ll know better than we do, some you’ll pick up quickly, and some you may need to research further. Much of the theoretical basis of data science comes from statistics. But data science as we know it is strongly influenced by technology and software engineering method- ologies, and has largely evolved in groups that are driven by computer science andxix
Comments 0
Loading comments...
Reply to Comment
Edit Comment