📄 Page
1
(This page has no text content)
📄 Page
2
THE ART OF MACHINE LEARNING A Hands-On Guide to Machine Learning with R by Norman Matloff San Francisco
📄 Page
3
THE ART OF MACHINE LEARNING. Copyright © 2024 by Norman Matloff. All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher. 27 26 25 24 23 1 2 3 4 5 ISBN-13: 978-1-7185-0210-9 (print) ISBN-13: 978-1-7185-0211-6 (ebook) Published by No Starch Press®, Inc. 245 8th Street, San Francisco, CA 94103 phone: +1.415.863.9900 www.nostarch.com; info@nostarch.com Publisher: William Pollock Managing Editor: Jill Franklin Production Manager: Sabrina Plomitallo-González Production Editor: Jennifer Kepler Developmental Editor: Jill Franklin Cover Illustrator: Gina Redman Interior Design: Octopod Studios Technical Reviewer: Ira Sharenow Copyeditor: George Hale Proofreader: Jamie Lauer Indexer: BIM Creatives, LLC Library of Congress Control Number: 2023002283 For customer service inquiries, please contact info@nostarch.com. For information on distribution, bulk sales, corporate sales, or translations: sales@nostarch.com. For permission to translate this work: rights@nostarch.com. To report counterfeit copies or piracy: counterfeit@nostarch.com. No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc. Other product and company names mentioned herein may be the trademarks of their respective owners. Rather than use a trademark symbol with every occurrence of a trademarked name, we are using the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.
📄 Page
4
The information in this book is distributed on an “As Is” basis, without warranty. While every precaution has been taken in the preparation of this work, neither the author nor No Starch Press, Inc. shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in it.
📄 Page
5
About the Author Dr. Norm Matloff is a professor of computer science at the University of California, Davis, and was formerly a professor of statistics at that university. Dr. Matloff was born in Los Angeles and grew up in East Los Angeles and the San Gabriel Valley. He has a PhD in pure mathematics from the University of California, Los Angeles. His current research interests are in machine learning, fair AI, parallel processing, statistical computing, and statistical methodology for handling missing data. Dr. Matloff is a former appointed member of IFIP Working Group 11.3, an international committee concerned with database software security, established under the United Nations. He was a founding member of the UC Davis Department of Statistics and participated in the formation of the Department of Computer Science as well. He is a recipient of the campus-wide Distinguished Teaching Award and Distinguished Public Service Award at UC Davis. He has served as editor in chief of the R Journal and on the editorial board of the Journal of Statistical Software. He is the author of several published textbooks. His R-based book, Statistical Regression and Classification: From Linear Models to Machine Learning, won the Ziegel Prize in 2017.
📄 Page
6
About the Technical Reviewer Ira Sharenow is a senior data analyst based in the San Francisco Bay Area. He has consulted for numerous businesses, with R, SQL, Microsoft Excel, and Tableau being his main tools. Sharenow was the technical editor for Norman Matloff’s previous book, Statistical Regression and Classification: From Linear Models to Machine Learning. In his spare time, he enjoys biking on Bay Area trails. His website is https://irasharenow.com.
📄 Page
7
BRIEF CONTENTS Acknowledgments Introduction PART I: PROLOGUE, AND NEIGHBORHOOD-BASED METHODS Chapter 1: Regression Models Chapter 2: Classification Models Chapter 3: Bias, Variance, Overfitting, and Cross-Validation Chapter 4: Dealing with Large Numbers of Features PART II: TREE-BASED METHODS Chapter 5: A Step Beyond k-NN: Decision Trees Chapter 6: Tweaking the Trees Chapter 7: Finding a Good Set of Hyperparameters PART III: METHODS BASED ON LINEAR RELATIONSHIPS Chapter 8: Parametric Methods Chapter 9: Cutting Things Down to Size: Regularization PART IV: METHODS BASED ON SEPARATING LINES AND PLANES Chapter 10: A Boundary Approach: Support Vector Machines Chapter 11: Linear Models on Steroids: Neural Networks PART V: APPLICATIONS Chapter 12: Image Classification
📄 Page
8
Chapter 13: Handling Time Series and Text Data Appendix A: List of Acronyms and Symbols Appendix B: Statistics and ML Terminology Correspondence Appendix C: Matrices, Data Frames, and Factor Conversions Appendix D: Pitfall: Beware of “p-Hacking”! Index
📄 Page
9
CONTENTS IN DETAIL ACKNOWLEDGMENTS INTRODUCTION 0.1 What Is ML? 0.2 The Role of Math in ML Theory and Practice 0.3 Why Another ML Book? 0.4 Recurring Special Sections 0.5 Background Needed 0.6 The qe*-Series Software 0.7 The Book’s Grand Plan 0.8 One More Point PART I PROLOGUE, AND NEIGHBORHOOD-BASED METHODS 1 REGRESSION MODELS 1.1 Example: The Bike Sharing Dataset 1.1.1 Loading the Data 1.1.2 A Look Ahead 1.2 Machine Learning and Prediction 1.2.1 Predicting Past, Present, and Future 1.2.2 Statistics vs. Machine Learning in Prediction 1.3 Introducing the k-Nearest Neighbors Method 1.3.1 Predicting Bike Ridership with k-NN 1.4 Dummy Variables and Categorical Variables 1.5 Analysis with qeKNN() 1.5.1 Predicting Bike Ridership with qeKNN() 1.6 The Regression Function: The Basis of ML 1.7 The Bias-Variance Trade-off
📄 Page
10
1.7.1 Analogy to Election Polls 1.7.2 Back to ML 1.8 Example: The mlb Dataset 1.9 k-NN and Categorical Features 1.10 Scaling 1.11 Choosing Hyperparameters 1.11.1 Predicting the Training Data 1.12 Holdout Sets 1.12.1 Loss Functions 1.12.2 Holdout Sets in the qe*-Series 1.12.3 Motivating Cross-Validation 1.12.4 Hyperparameters, Dataset Size, and Number of Features 1.13 Pitfall: p-Hacking and Hyperparameter Selection 1.14 Pitfall: Long-Term Time Trends 1.15 Pitfall: Dirty Data 1.16 Pitfall: Missing Data 1.17 Direct Access to the regtools k-NN Code 1.18 Conclusions 2 CLASSIFICATION MODELS 2.1 Classification Is a Special Case of Regression 2.2 Example: The Telco Churn Dataset 2.2.1 Pitfall: Factor Data Read as Non-factor 2.2.2 Pitfall: Retaining Useless Features 2.2.3 Dealing with NA Values 2.2.4 Applying the k-Nearest Neighbors Method 2.2.5 Pitfall: Overfitting Due to Features with Many Categories 2.3 Example: Vertebrae Data 2.3.1 Analysis 2.4 Pitfall: Error Rate Improves Only Slightly Using the Features 2.5 The Confusion Matrix
📄 Page
11
2.6 Clearing the Confusion: Unbalanced Data 2.6.1 Example: The Kaggle Appointments Dataset 2.6.2 A Better Approach to Unbalanced Data 2.7 Receiver Operating Characteristic and Area Under Curve 2.7.1 Details of ROC and AUC 2.7.2 The qeROC() Function 2.7.3 Example: Telco Churn Data 2.7.4 Example: Vertebrae Data 2.7.5 Pitfall: Overreliance on AUC 2.8 Conclusions 3 BIAS, VARIANCE, OVERFITTING, AND CROSS- VALIDATION 3.1 Overfitting and Underfitting 3.1.1 Intuition Regarding the Number of Features and Overfitting 3.1.2 Relation to Overall Dataset Size 3.1.3 Well Then, What Are the Best Values of k and p? 3.2 Cross-Validation 3.2.1 K-Fold Cross-Validation 3.2.2 Using the replicMeans() Function 3.2.3 Example: Programmer and Engineer Data 3.2.4 Triple Cross-Validation 3.3 Conclusions 4 DEALING WITH LARGE NUMBERS OF FEATURES 4.1 Pitfall: Computational Issues in Large Datasets 4.2 Introduction to Dimension Reduction 4.2.1 Example: The Million Song Dataset 4.2.2 The Need for Dimension Reduction 4.3 Methods for Dimension Reduction 4.3.1 Consolidation and Embedding
📄 Page
12
4.3.2 The All Possible Subsets Method 4.3.3 Principal Components Analysis 4.3.4 But Now We Have Two Hyperparameters 4.3.5 Using the qePCA() Wrapper 4.3.6 PCs and the Bias-Variance Trade-off 4.4 The Curse of Dimensionality 4.5 Other Methods of Dimension Reduction 4.5.1 Feature Ordering by Conditional Independence 4.5.2 Uniform Manifold Approximation and Projection 4.6 Going Further Computationally 4.7 Conclusions PART II TREE-BASED METHODS 5 A STEP BEYOND K-NN: DECISION TREES 5.1 Basics of Decision Trees 5.2 The qeDT() Function 5.2.1 Looking at the Plot 5.3 Example: New York City Taxi Data 5.3.1 Pitfall: Too Many Combinations of Factor Levels 5.3.2 Tree-Based Analysis 5.4 Example: Forest Cover Data 5.5 Decision Tree Hyperparameters: How to Split? 5.6 Hyperparameters in the qeDT() Function 5.7 Conclusions 6 TWEAKING THE TREES 6.1 Bias vs. Variance, Bagging, and Boosting 6.2 Bagging: Generating New Trees by Resampling 6.2.1 Random Forests 6.2.2 The qeRF() Function
📄 Page
13
6.2.3 Example: Vertebrae Data 6.2.4 Example: Remote-Sensing Soil Analysis 6.3 Boosting: Repeatedly Tweaking a Tree 6.3.1 Implementation: AdaBoost 6.3.2 Gradient Boosting 6.3.3 Example: Call Network Monitoring 6.3.4 Example: Vertebrae Data 6.3.5 Bias vs. Variance in Boosting 6.3.6 Computational Speed 6.3.7 Further Hyperparameters 6.3.8 The Learning Rate 6.4 Pitfall: No Free Lunch 7 FINDING A GOOD SET OF HYPERPARAMETERS 7.1 Combinations of Hyperparameters 7.2 Grid Searching with qeFT() 7.2.1 How to Call qeFT() 7.3 Example: Programmer and Engineer Data 7.3.1 Confidence Intervals 7.3.2 The Takeaway on Grid Searching 7.4 Example: Programmer and Engineer Data 7.5 Example: Phoneme Data 7.6 Conclusions PART III METHODS BASED ON LINEAR RELATIONSHIPS 8 PARAMETRIC METHODS 8.1 Motivating Example: The Baseball Player Data 8.1.1 A Graph to Guide Our Intuition 8.1.2 View as Dimension Reduction 8.2 The lm() Function
📄 Page
14
8.3 Wrapper for lm() in the qe*-Series: qeLin() 8.4 Use of Multiple Features 8.4.1 Example: Baseball Player, Continued 8.4.2 Beta Notation 8.4.3 Example: Airbnb Data 8.4.4 Applying the Linear Model 8.5 Dimension Reduction 8.5.1 Which Features Are Important? 8.5.2 Statistical Significance and Dimension Reduction 8.6 Least Squares and Residuals 8.7 Diagnostics: Is the Linear Model Valid? 8.7.1 Exactness? 8.7.2 Diagnostic Methods 8.8 The R-Squared Value(s) 8.9 Classification Applications: The Logistic Model 8.9.1 The glm() and qeLogit() Functions 8.9.2 Example: Telco Churn Data 8.9.3 Multiclass Case 8.9.4 Example: Fall Detection Data 8.10 Bias and Variance in Linear/Generalized Linear Models 8.10.1 Example: Bike Sharing Data 8.11 Polynomial Models 8.11.1 Motivation 8.11.2 Modeling Nonlinearity with a Linear Model 8.11.3 Polynomial Logistic Regression 8.11.4 Example: Programmer and Engineer Wages 8.12 Blending the Linear Model with Other Methods 8.13 The qeCompare() Function 8.13.1 Need for Caution Regarding Polynomial Models 8.14 What’s Next 9 CUTTING THINGS DOWN TO SIZE: REGULARIZATION
📄 Page
15
9.1 Motivation 9.2 Size of a Vector 9.3 Ridge Regression and the LASSO 9.3.1 How They Work 9.3.2 The Bias-Variance Trade-off, Avoiding Overfitting 9.3.3 Relation Between λ, n, and p 9.3.4 Comparison, Ridge vs. LASSO 9.4 Software 9.5 Example: NYC Taxi Data 9.6 Example: Airbnb Data 9.7 Example: African Soil Data 9.7.1 LASSO Analysis 9.8 Optional Section: The Famous LASSO Picture 9.9 Coming Up PART IV METHODS BASED ON SEPARATING LINES AND PLANES 10 A BOUNDARY APPROACH: SUPPORT VECTOR MACHINES 10.1 Motivation 10.1.1 Example: The Forest Cover Dataset 10.2 Lines, Planes, and Hyperplanes 10.3 Math Notation 10.3.1 Vector Expressions 10.3.2 Dot Products 10.3.3 SVM as a Parametric Model 10.4 SVM: The Basic Ideas—Separable Case 10.4.1 Example: The Anderson Iris Dataset 10.4.2 Optimizing Criterion 10.5 Major Problem: Lack of Linear Separability 10.5.1 Applying a “Kernel” 10.5.2 Soft Margin 10.6 Example: Forest Cover Data
📄 Page
16
10.7 And What About That Kernel Trick? 10.8 “Warning: Maximum Number of Iterations Reached” 10.9 Summary 11 LINEAR MODELS ON STEROIDS: NEURAL NETWORKS 11.1 Overview 11.2 Working on Top of a Complex Infrastructure 11.3 Example: Vertebrae Data 11.4 Neural Network Hyperparameters 11.5 Activation Functions 11.6 Regularization 11.6.1 L1 and L2 Regularization 11.6.2 Regularization by Dropout 11.7 Example: Fall Detection Data 11.8 Pitfall: Convergence Problems 11.9 Close Relation to Polynomial Regression 11.10 Bias vs. Variance in Neural Networks 11.11 Discussion PART V APPLICATIONS 12 IMAGE CLASSIFICATION 12.1 Example: The Fashion MNIST Data 12.1.1 A First Try Using a Logit Model 12.1.2 Refinement via PCA 12.2 Convolutional Models 12.2.1 Need for Recognition of Locality 12.2.2 Overview of Convolutional Methods 12.2.3 Image Tiling 12.2.4 The Convolution Operation 12.2.5 The Pooling Operation
📄 Page
17
12.2.6 Shape Evolution Across Layers 12.2.7 Dropout 12.2.8 Summary of Shape Evolution 12.2.9 Translation Invariance 12.3 Tricks of the Trade 12.3.1 Data Augmentation 12.3.2 Pretrained Networks 12.4 So, What About the Overfitting Issue? 12.5 Conclusions 13 HANDLING TIME SERIES AND TEXT DATA 13.1 Converting Time Series Data to Rectangular Form 13.1.1 Toy Example 13.1.2 The regtools Function TStoX() 13.2 The qeTS() Function 13.3 Example: Weather Data 13.4 Bias vs. Variance 13.5 Text Applications 13.5.1 The Bag-of-Words Model 13.5.2 The qeText() Function 13.5.3 Example: Quiz Data 13.5.4 Example: AG News Dataset 13.6 Summary A LIST OF ACRONYMS AND SYMBOLS B STATISTICS AND ML TERMINOLOGY CORRESPONDENCE C MATRICES, DATA FRAMES, AND FACTOR CONVERSIONS
📄 Page
18
C.1 Matrices C.2 Conversions: Between R Factors and Dummy Variables, Between Data Frames and Matrices D PITFALL: BEWARE OF “P-HACKING”! INDEX
📄 Page
19
ACKNOWLEDGMENTS Over my many years in the business, a number of people have been influential in my thinking on machine learning (ML) and statistical issues. I’ll just mention a few. First, I learn as much from my research students as they learn from me. I cite, in particular, those who have recently done ML work with me or have otherwise interacted with me on ML: Vishal Chakraborty, Yu-Shih Chen, Xi Cheng, Melissa Goh, Rongkui Han, Lan Jiang, Tiffany Jiang, Collin Kennedy, Kenneth Lee, Pooja Rajkumar, Ariel Shin, Robert Tucker, Bochao Xin, Wenxi Zhang, and especially Zhiyuan (Daniel) Guo, Noah Perry, Robin Yancey, and Wenxuan (Allan) Zhao. It has also been quite stimulating to collaborate on ML issues with Bohdan Khomtchouk (University of Chicago), Professor Richard Levenson (UC Davis School of Medicine), and Pete Mohanty (Google). I owe many thanks to Achim Zeileis, one of the authors of the partykit package of R routines that form excellent implementations of data partitioning methods. He has been quite helpful in guiding me through the nuances of the package. Nello Cristianini kindly read an initially flawed draft of the SVM chapter of my earlier book, Statistical Regression and Classification: From Linear Models to Machine Learning, thus improving my present chapter on the topic as well. Writing a technical book is a form of teaching. In that regard, I owe much to the late Ray Redheffer, for whom I was a teaching assistant way back in grad school. He excelled at getting students in large calculus classes to also do some advanced topics, even some simple proofs. The message was clear to me: even when teaching at an elementary level, one should expect students to understand what they are doing rather than merely memorize formulas and mimic common solution patterns. With a very applied field with major consequences, it is even more important that authors of ML books strive for their readers to understand what they are doing rather than simply learn library function call forms.
📄 Page
20
I’ll also cite the late Chuck Stone. He was my fellow native of Los Angeles, my informal statistical mentor in grad school, and my close friend over the several decades he lived in the Bay Area. He was one of the four original developers of the CART method and did pioneering work in the theory of k-nearest neighbor methodology. He was quite highly opinionated and blunt—but that is exactly the kind of person who makes one think, question, and defend one’s assumptions, whether they be on statistics, climate change, or the economy. Ira Sharenow, the internal reviewer for both this book and Statistical Regression and Classification, once again did a first-rate job, as did No Starch Press’s Jill Franklin. I am really indebted to Bill Pollock, publisher of No Starch Press, for producing this book and two previous ones. Also thanks to John Kimmel, editor of my three books at CRC Press, who recently retired after a distinguished career. I owe much to Bill and John for their longtime encouragement and guidance. Thanks to their help, hopefully I’ve figured out this book-writing business by now.