Real-World Machine Learning (Henrik Brink, Joseph Richards, Mark Fetherolf) (Z-Library)

M A N N I N G Henrik Brink Joseph W. Richards Mark Fetherolf FOREWORD BY Beau Cronin

Real-World Machine Learning

(This page has no text content)

Real-World Machine Learning HENRIK BRINK JOSEPH W. RICHARDS MARK FETHEROLF M A N N I N G SHELTER ISLAND

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2017 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editor: Susanna Kline 20 Baldwin Road Technical development editor: Al Scherer PO Box 761 Review editors: Olivia Booth, Ozren Harlovic Shelter Island, NY 11964 Project editor: Kevin Sullivan Copyeditor: Sharon Wilkey Proofreader: Katie Tennant Technical proofreader: Valentin Crettaz Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781617291920 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 21 20 19 18 17 16

brief contents PART 1 THE MACHINE-LEARNING WORKFLOW .............................1 1 ■ What is machine learning? 3 2 ■ Real-world data 27 3 ■ Modeling and prediction 52 4 ■ Model evaluation and optimization 77 5 ■ Basic feature engineering 106 PART 2 PRACTICAL APPLICATION ...........................................127 6 ■ Example: NYC taxi data 129 7 ■ Advanced feature engineering 146 8 ■ Advanced NLP example: movie review sentiment 172 9 ■ Scaling machine-learning workflows 196 10 ■ Example: digital display advertising 214v

(This page has no text content)

contents foreword xiii preface xv acknowledgments xvii about this book xviii about the authors xxi about the cover illustration xxii PART 1 THE MACHINE-LEARNING WORKFLOW .................1 1 What is machine learning? 3 1.1 Understanding how machines learn 4 1.2 Using data to make decisions 7 Traditional approaches 8 ■ The machine-learning approach 11 Five advantages to machine learning 16 ■ Challenges 16 1.3 Following the ML workflow: from data to deployment 17 Data collection and preparation 18 ■ Learning a model from data 19 ■ Evaluating model performance 20 Optimizing model performance 21vii

CONTENTSviii1.4 Boosting model performance with advanced techniques 22 Data preprocessing and feature engineering 22 ■ Improving models continually with online methods 24 ■ Scaling models with data volume and velocity 25 1.5 Summary 25 1.6 Terms from this chapter 25 2 Real-world data 27 2.1 Getting started: data collection 28 Which features should be included? 30 ■ How can we obtain ground truth for the target variable? 32 ■ How much training data is required? 33 ■ Is the training set representative enough? 35 2.2 Preprocessing the data for modeling 36 Categorical features 36 ■ Dealing with missing data 38 Simple feature engineering 40 ■ Data normalization 42 2.3 Using data visualization 43 Mosaic plots 44 ■ Box plots 46 ■ Density plots 48 Scatter plots 50 2.4 Summary 50 2.5 Terms from this chapter 51 3 Modeling and prediction 52 3.1 Basic machine-learning modeling 53 Finding the relationship between input and target 53 The purpose of finding a good model 55 ■ Types of modeling methods 56 ■ Supervised versus unsupervised learning 58 3.2 Classification: predicting into buckets 59 Building a classifier and making predictions 61 Classifying complex, nonlinear data 64 Classifying with multiple classes 66 3.3 Regression: predicting numerical values 68 Building a regressor and making predictions 69 Performing regression on complex, nonlinear data 73 3.4 Summary 74 3.5 Terms from this chapter 75

CONTENTS ix4 Model evaluation and optimization 77 4.1 Model generalization: assessing predictive accuracy for new data 78 The problem: overfitting and model optimism 79 ■ The solution: cross-validation 82 ■ Some things to look out for when using cross-validation 86 4.2 Evaluation of classification models 87 Class-wise accuracy and the confusion matrix 89 Accuracy trade-offs and ROC curves 90 ■ Multiclass classification 93 4.3 Evaluation of regression models 96 Using simple regression performance metrics 97 Examining residuals 99 4.4 Model optimization through parameter tuning 100 ML algorithms and their tuning parameters 100 Grid search 101 4.5 Summary 104 4.6 Terms from this chapter 105 5 Basic feature engineering 106 5.1 Motivation: why is feature engineering useful? 107 What is feature engineering? 107 ■ Five reasons to use feature engineering 107 ■ Feature engineering and domain expertise 109 5.2 Basic feature-engineering processes 110 Example: event recommendation 110 ■ Handling date and time features 112 ■ Working with simple text features 114 5.3 Feature selection 116 Forward selection and backward elimination 119 ■ Feature selection for data exploration 121 ■ Real-world feature selection example 123 5.4 Summary 125 5.5 Terms from this chapter 126

CONTENTSxPART 2 PRACTICAL APPLICATION ...............................127 6 Example: NYC taxi data 129 6.1 Data: NYC taxi trip and fare information 130 Visualizing the data 130 ■ Defining the problem and preparing the data 134 6.2 Modeling 137 Basic linear model 137 ■ Nonlinear classifier 138 Including categorical features 140 ■ Including date-time features 142 ■ Model insights 143 6.3 Summary 144 6.4 Terms from this chapter 145 7 Advanced feature engineering 146 7.1 Advanced text features 146 Bag-of-words model 147 ■ Topic modeling 149 Content expansion 152 7.2 Image features 154 Simple image features 154 ■ Extracting objects and shapes 156 7.3 Time-series features 160 Types of time-series data 160 ■ Prediction on time-series data 163 ■ Classical time-series features 163 Feature engineering for event streams 168 7.4 Summary 168 7.5 Terms from this chapter 170 8 Advanced NLP example: movie review sentiment 172 8.1 Exploring the data and use case 173 A first glance at the dataset 173 ■ Inspecting the dataset 174 So what’s the use case? 175 8.2 Extracting basic NLP features and building the initial model 178 Bag-of-words features 178 ■ Building the model with the naïve Bayes algorithm 180 ■ Normalizing bag-of-words features with the tf-idf algorithm 184 ■ Optimizing model parameters 185

CONTENTS xi8.3 Advanced algorithms and model deployment considerations 190 Word2vec features 190 ■ Random forest model 192 8.4 Summary 195 8.5 Terms from this chapter 195 9 Scaling machine-learning workflows 196 9.1 Before scaling up 197 Identifying important dimensions 197 ■ Subsampling training data in lieu of scaling? 199 ■ Scalable data management systems 201 9.2 Scaling ML modeling pipelines 203 Scaling learning algorithms 204 9.3 Scaling predictions 207 Scaling prediction volume 208 ■ Scaling prediction velocity 209 9.4 Summary 211 9.5 Terms from this chapter 212 10 Example: digital display advertising 214 10.1 Display advertising 215 10.2 Digital advertising data 216 10.3 Feature engineering and modeling strategy 216 10.4 Size and shape of the data 218 10.5 Singular value decomposition 220 10.6 Resource estimation and optimization 222 10.7 Modeling 224 10.8 K-nearest neighbors 224 10.9 Random forests 226 10.10 Other real-world considerations 227 10.11 Summary 228 10.12 Terms from this chapter 229 10.13 Recap and conclusion 229 appendix Popular machine-learning algorithms 232 index 236

(This page has no text content)

foreword Machine learning (ML) has become big business in the last few years: companies are using it to make money, applied research has exploded in both industrial and aca- demic settings, and curious developers everywhere are looking to level up their ML skills. But this newfound demand has largely outrun the supply of good methods for learning how these techniques are used in the wild. This book fills a pressing need. Applied machine learning comprises equal parts mathematical principles and tricks pulled from a bag—it is, in other words, a true craft. Concentrating too much on either aspect at the expense of the other is a failure mode. Balance is essential. For a long time, the best—and the only—way to learn machine learning was to pursue an advanced degree in one of the fields that (largely separately) developed sta- tistical learning and optimization techniques. The focus in these programs was on the core algorithms, including their theoretical properties and bounds, as well as the char- acteristic domain problems of the field. In parallel, though, an equally valuable lore was accumulated and passed down through unofficial channels: conference hallways, the tribal wisdom of research labs, and the data processing scripts passed between col- leagues. This lore was what actually allowed the work to get done, establishing which algorithms were most appropriate in each situation, how the data needed to be mas- saged at each step, and how to wire up the different parts of the pipeline. Cut to today. We now live in an era of open source riches, with high-quality imple- mentations of most ML algorithms readily available on GitHub, as well as comprehen- sive and well-architected frameworks to tie all the pieces together. But in the midst of this abundance, the unofficial lore has remained stubbornly inaccessible. The authorsxiii

FOREWORDxivof this book provide a great service by finally bringing this dark knowledge together in one place; this is a key missing piece as machine learning moves from esoteric aca- demic discipline to core software engineering skillset. Another point worth emphasizing: most of the machine-learning methods in broad use today are far from perfect, meeting few of the desiderata we might list, were we in a position to design the perfect solution. The current methods are picky about the data they will accept. They are, by and large, happy to provide overly confident predictions if not carefully tended. Small changes in their input can lead to large and mysterious changes in the models they learn. Their results can be difficult to interpret and further interrogate. Modern ML engineering can be viewed as an exercise in managing and mitigating these (and other) rough edges of the underlying optimiza- tion and statistical learning methods. This book is organized exactly as it should be to prepare the reader for these reali- ties. It first covers the typical workflow of machine-learning projects before diving into extended examples that show how this basic framework can be applied in realistic (read: messy) situations. Skimming through these pages, you’ll find few equations (they’re all available elsewhere, including the many classic texts in the field) but instead much of the hidden wisdom on how to go about implementing products and solutions based on machine learning. This is, far and away, the best of times to be learning about this subject, and this book is an essential complement to the cornucopia of mathematical and formal knowledge available elsewhere. It is that crucial other book that many old hands wish they had back in the day. BEAU CRONIN HEAD OF DATA, 21 INC. BERKELEY, CA

preface As a student of physics and astronomy, I spent a significant proportion of my time dealing with data from measurements and simulations, with the goal of deriving sci- entific value by analyzing, visualizing, and modeling the data. With a background as a programmer, I quickly learned to use my programming skills as an important aspect of working with data. When I was first introduced to the world of machine learning, it showed not only great potential as a new tool in the data toolbox, but also a beautiful combination of the two fields that interested me the most: data sci- ence and programming. Machine learning became an important part of my research in the physical sci- ences and led me to the UC Berkeley astronomy department, where statisticians, phys- icists, and computer scientists were working together to understand the universe, with machine learning as an increasingly important tool. At the Center for Time Domain Informatics, I met Joseph Richards, a statistician and coauthor of this book. We learned not only that we could use our data science and machine-learning techniques to do scientific research, but also that there was increasing interest from companies and industries from outside academia. We co-founded Wise.io with Damian Eads, Dan Starr, and Joshua Bloom to make machine learning accessible to businesses. For the past four years, Wise.io has been working with countless companies to opti- mize, augment, and automate their processes via machine learning. We built a large- scale machine-learning application platform that makes hundreds of millions of pre- dictions every month for our clients, and we learned that data in the real world isxv

PREFACExvimessy in ways that continue to surprise us. We hope to pass on to you some of our knowledge of how to work with real-world data and build the next generation of intel- ligent software with machine learning. Mark Fetherolf, our third coauthor, was a founder and CTO of multiple startups in systems management and business analytics, built on traditional statistical and quantitative methods. While working on systems to measure and optimize petro- chemical refining processes, he and his team realized that the techniques they were using for process manufacturing could be applied to the performance of databases, computer systems, and networks. Their distributed systems management technolo- gies are embedded in leading systems management products. Subsequent ventures were in the measurement and optimization of telecommunications and customer interaction management systems. A few years later, he got hooked on Kaggle competitions and became a machine- learning convert. He led a cable television recommender project and by necessity learned a lot about big-data technologies, adapting computational algorithms for parallel computing, and the ways people respond to recommendations made by machines. In recent years, he has done consulting work in the application of machine learning and predictive analytics to the real-world applications of digital advertising, telecommunications, semiconductor manufacturing, systems management, and cus- tomer experience optimization. HENRIK BRINK

acknowledgments We wish to thank Manning Publications and everyone there who contributed to the development and production of the book, in particular Susanna Kline, for her patient and consistent guidance throughout the writing process. Our thanks to Beau Cronin, for writing the foreword. Thanks also to Valentin Crettaz, who gave all chapters a thorough technical proofread. Many other reviewers provided us with helpful feedback along the way: Alain Couniot, Alessandrini Alfredo, Alex Iverson, Arthur Zubarev, David H. Clements, Dean Iverson, Jacob Quant, Jan Goyvaerts, Kostas Passadis, Leif Singer, Louis Luangkesorn, Massimo Ilario, Michael Lund, Moran Koren, Pablo Domínguez Vaselli, Patrick Toohey, Ravishankar Rajagopalan, Ray Lugo, Ray Morehead, Rees Morrison, Rizwan Patel, Robert Diana, and Ursin Stauss. Mark Fetherolf thanks Craig Carmichael for sharing his machine-learning obses- sion; and his wife, Patricia, and daughter, Amy, for many years of tolerance. Henrik Brink would like to thank the founders and the rest of the team at Wise.io for their shared passion for using machine learning to solve real-world problems. He thanks his parents, Edith and Jonny, and his brother and sister for passing on a pas- sion for knowledge and words, and—most important—he’d like to thank his wife, Ida, and his son, Harald, for their love and support. Joseph Richards also thanks the Wise.io team for their shared enthusiasm for machine learning and endless commitment and energy, which makes coming in to work every day a real treat. He would like to especially thank his parents, Susan and Carl, for teaching him the joy of life-long learning and for instilling in him the values of hard work and empathy. And, most important, he thanks his wife, Trishna, for her endless love, compassion, and support.xvii

about this book Real-World Machine Learning is a book for people who want to apply machine learning (ML) to their own real-world problems. It describes and explains the processes, algo- rithms, and tools that mainstream ML comprises. The focus is on the practical appli- cation of well-known algorithms, not building them from scratch. Each step in the process of building and using ML models is presented and illustrated through exam- ples that range from simple to intermediate-level complexity. Roadmap Part 1, “The machine-learning workflow,” introduces each of the five steps of the basic machine-learning workflow with a chapter: ■ Chapter 1, “What is machine learning?” introduces the field of machine learn- ing and what it’s useful for. ■ Chapter 2, “Real-world data,” dives into common data processing and prepara- tion steps in the ML workflow. ■ Chapter 3, “Modeling and prediction,” introduces how to build simple ML models and make predictions with widely used algorithms and libraries. ■ Chapter 4, “Model evaluation and optimization,” dives deeper into your ML models to evaluate and optimize their performance. ■ Chapter 5, “Basic feature engineering,” introduces the most common ways to augment your raw data with your knowledge of the problem.xviii

ABOUT THIS BOOK xixPart 2, “Practical application,” introduces techniques for scaling your models and extracting features from text, images, and time-series data to improve performance on many modern ML problems. This part also includes three full example chapters. ■ Chapter 6, “Example: NYC taxi data,” is the first full example chapter. You’ll try to predict the tipping behavior of passengers. ■ Chapter 7, “Advanced feature engineering,” covers advanced feature engineer- ing processes that allow you to extract value out of natural-language text, images, and time-series data. ■ Chapter 8, “Advanced NLP example: movie review sentiment,” uses your advanced feature engineering knowledge to try to predict the sentiment of online movie reviews. ■ Chapter 9, “Scaling machine-learning workflows,” presents techniques for scal- ing ML systems to larger volumes of data, higher prediction throughput, and lower prediction latency. ■ Chapter 10, “Example: digital display advertising,” builds a model on large amounts of data, predicting online digital display advertisement click behavior. How to use this book If you’re new to machine learning, chapters 1 through 5 will guide you through the processes of data preparation and exploration, feature engineering, modeling, and model evaluation. Our Python examples use the popular data manipulation and machine-learning libraries pandas and scikit-learn. Chapters 6 through 10 include three practical machine-learning examples along with advanced topics in feature engi- neering and optimization. Because the libraries encapsulate most of the complexity, our code samples can easily be adapted to your own ML applications. Intended audience This book will enable programmers, data analysts, statisticians, data scientists, and oth- ers to apply machine learning to practical problems, or simply to understand it. They’ll gain practical experience with real-world data, modeling, optimization, and deployment of machine-learning systems without deep theoretical derivations of spe- cific algorithms. The mathematical basis of machine learning is discussed for those who are interested, some algorithms are explained at a high level, and references are provided for those who would like to dig deeper. The focus is on getting practical results to solve the problems at hand. Code conventions, downloads, and software requirements This book contains many examples of source code both in numbered listings and inline with normal text. In both cases, source code is formatted in a fixed-width font like this to separate it from ordinary text.

Statistics

Uploader

Real-World Machine Learning (Henrik Brink, Joseph Richards, Mark Fetherolf) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Statistics

Uploader

Real-World Machine Learning (Henrik Brink, Joseph Richards, Mark Fetherolf) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment