Machine Learning with R Learn techniques for building and improving machine learning models (Brett Lantz) (Z-Library)
Author: Brett Lantz
教育
No Description
📄 File Format:
PDF
💾 File Size:
8.5 MB
63
Views
0
Downloads
0.00
Total Donations
📄 Text Preview (First 20 pages)
ℹ️
Registered users can read the full content for free
Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.
📄 Page
1
(This page has no text content)
📄 Page
2
Machine Learning with R Fourth Edition Learn techniques for building and improving machine learning models, from data preparation to model tuning, evaluation, and working with big data Brett Lantz BIRMINGHAM—MUMBAI
📄 Page
3
Machine Learning with R Fourth Edition Copyright © 2023 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Lead Senior Publishing Product Manager: Tushar Gupta Acquisition Editor – Peer Reviews: Saby Dsilva Project Editor: Janice Gonsalves Content Development Editors: Bhavesh Amin and Elliot Dallow Copy Editor: Safis Editor Technical Editor: Karan Sonawane Indexer: Hemangini Bari Presentation Designer: Pranit Padwal Developer Relations Marketing Executive: Monika Sangwan First published: October 2013 Second edition: July 2015 Third edition: April 2019 Fourth edition: May 2023 Production reference: 1190523 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80107-132-1 www.packt.com
📄 Page
4
Contributors About the author Brett Lantz (@DataSpelunking) has spent more than 15 years using innovative data methods to understand human behavior. A sociologist by training, Brett was first captivated by machine learning while studying a large database of teenagers’ social network profiles. Brett is a DataCamp instructor and has taught machine learning workshops around the world. He is known to geek out about data science applications for sports, video games, autonomous vehicles, and foreign language learning, among many other subjects, and hopes to eventually blog about such topics at dataspelunking.com. It is hard to describe how much my world has changed since the first edition of this book was published nearly ten years ago! My sons Will and Cal were born amidst the first and second editions, respectively, and have grown alongside my career. This edition, which consumed two years of weekends, would have been impossible without the backing of my wife, Jessica. Many thanks are due also to the friends, mentors, and supporters who opened the doors that led me along this unexpected data science journey.
📄 Page
5
About the reviewer Daniel D. Gutierrez is an independent consultant in data science through his firm AMULET Analytics. He’s also a technology journalist, serving as Editor-in-Chief for insideBIGDATA.com, where he enjoys keeping his finger on the pulse of this fast-paced industry. Daniel is also an ed- ucator, having taught data science, machine learning and R classes at university level for many years. He currently teaches data science for UCLA Extension. He has authored four computer industry books on database and data science technology, including his most recent title, Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R. Daniel holds a BS in Mathematics and Computer Science from UCLA. Join our book’s Discord space Join our Discord community to meet like-minded people and learn alongside more than 4000 people at: https://packt.link/r
📄 Page
6
Table of Contents Preface xvii Chapter 1: Introducing Machine Learning 1 The origins of machine learning ......................................................................................... 2 Uses and abuses of machine learning .................................................................................. 5 Machine learning successes • 7 The limits of machine learning • 8 Machine learning ethics • 10 How machines learn ......................................................................................................... 14 Data storage • 16 Abstraction • 16 Generalization • 20 Evaluation • 22 Machine learning in practice ............................................................................................ 23 Types of input data • 24 Types of machine learning algorithms • 26 Matching input data to algorithms • 31 Machine learning with R ................................................................................................... 32 Installing R packages • 33 Loading and unloading R packages • 34 Installing RStudio • 34 Why R and why R now? • 36 Summary .......................................................................................................................... 38
📄 Page
7
Table of Contentsvi Chapter 2: Managing and Understanding Data 39 R data structures ............................................................................................................... 40 Vectors • 40 Factors • 43 Lists • 44 Data frames • 47 Matrices and arrays • 51 Managing data with R ....................................................................................................... 52 Saving, loading, and removing R data structures • 52 Importing and saving datasets from CSV files • 54 Importing common dataset formats using RStudio • 56 Exploring and understanding data ................................................................................... 59 Exploring the structure of data • 60 Exploring numeric features • 61 Measuring the central tendency – mean and median • 62 Measuring spread – quartiles and the five-number summary • 64 Visualizing numeric features – boxplots • 66 Visualizing numeric features – histograms • 68 Understanding numeric data – uniform and normal distributions • 70 Measuring spread – variance and standard deviation • 71 Exploring categorical features • 73 Measuring the central tendency – the mode • 74 Exploring relationships between features • 76 Visualizing relationships – scatterplots • 76 Examining relationships – two-way cross-tabulations • 78 Summary .......................................................................................................................... 82 Chapter 3: Lazy Learning – Classification Using Nearest Neighbors 83 Understanding nearest neighbor classification ................................................................. 84 The k-NN algorithm • 84 Measuring similarity with distance • 88
📄 Page
8
Table of Contents vii Choosing an appropriate k • 90 Preparing data for use with k-NN • 91 Why is the k-NN algorithm lazy? • 94 Example – diagnosing breast cancer with the k-NN algorithm ......................................... 95 Step 1 – collecting data • 96 Step 2 – exploring and preparing the data • 96 Transformation – normalizing numeric data • 98 Data preparation – creating training and test datasets • 100 Step 3 – training a model on the data • 101 Step 4 – evaluating model performance • 103 Step 5 – improving model performance • 104 Transformation – z-score standardization • 104 Testing alternative values of k • 106 Summary ......................................................................................................................... 107 Chapter 4: Probabilistic Learning – Classification Using Naive Bayes 109 Understanding Naive Bayes ............................................................................................. 110 Basic concepts of Bayesian methods • 110 Understanding probability • 111 Understanding joint probability • 112 Computing conditional probability with Bayes’ theorem • 114 The Naive Bayes algorithm • 117 Classification with Naive Bayes • 118 The Laplace estimator • 120 Using numeric features with Naive Bayes • 122 Example – filtering mobile phone spam with the Naive Bayes algorithm ......................... 123 Step 1 – collecting data • 124 Step 2 – exploring and preparing the data • 125 Data preparation – cleaning and standardizing text data • 126 Data preparation – splitting text documents into words • 132 Data preparation – creating training and test datasets • 135
📄 Page
9
Table of Contentsviii Visualizing text data – word clouds • 136 Data preparation – creating indicator features for frequent words • 139 Step 3 – training a model on the data • 141 Step 4 – evaluating model performance • 143 Step 5 – improving model performance • 144 Summary ......................................................................................................................... 145 Chapter 5: Divide and Conquer – Classification Using Decision Trees and Rules 147 Understanding decision trees ......................................................................................... 148 Divide and conquer • 149 The C5.0 decision tree algorithm • 153 Choosing the best split • 154 Pruning the decision tree • 157 Example – identifying risky bank loans using C5.0 decision trees ................................... 158 Step 1 – collecting data • 159 Step 2 – exploring and preparing the data • 159 Data preparation – creating random training and test datasets • 161 Step 3 – training a model on the data • 163 Step 4 – evaluating model performance • 169 Step 5 – improving model performance • 170 Boosting the accuracy of decision trees • 170 Making some mistakes cost more than others • 173 Understanding classification rules ................................................................................... 175 Separate and conquer • 176 The 1R algorithm • 178 The RIPPER algorithm • 180 Rules from decision trees • 182 What makes trees and rules greedy? • 183 Example – identifying poisonous mushrooms with rule learners .................................... 185 Step 1 – collecting data • 186
📄 Page
10
Table of Contents ix Step 2 – exploring and preparing the data • 186 Step 3 – training a model on the data • 187 Step 4 – evaluating model performance • 189 Step 5 – improving model performance • 190 Summary ........................................................................................................................ 194 Chapter 6: Forecasting Numeric Data – Regression Methods 197 Understanding regression ............................................................................................... 198 Simple linear regression • 200 Ordinary least squares estimation • 203 Correlations • 205 Multiple linear regression • 207 Generalized linear models and logistic regression • 212 Example – predicting auto insurance claims costs using linear regression ...................... 218 Step 1 – collecting data • 219 Step 2 – exploring and preparing the data • 220 Exploring relationships between features – the correlation matrix • 223 Visualizing relationships between features – the scatterplot matrix • 224 Step 3 – training a model on the data • 227 Step 4 – evaluating model performance • 230 Step 5 – improving model performance • 232 Model specification – adding nonlinear relationships • 232 Model specification – adding interaction effects • 233 Putting it all together – an improved regression model • 233 Making predictions with a regression model • 235 Going further – predicting insurance policyholder churn with logistic regression • 238 Understanding regression trees and model trees ............................................................ 245 Adding regression to trees • 246 Example – estimating the quality of wines with regression trees and model trees .......... 248 Step 1 – collecting data • 249 Step 2 – exploring and preparing the data • 250
📄 Page
11
Table of Contentsx Step 3 – training a model on the data • 252 Visualizing decision trees • 255 Step 4 – evaluating model performance • 257 Measuring performance with the mean absolute error • 257 Step 5 – improving model performance • 259 Summary ........................................................................................................................ 262 Chapter 7: Black-Box Methods – Neural Networks and Support Vector Machines 265 Understanding neural networks ..................................................................................... 266 From biological to artificial neurons • 267 Activation functions • 269 Network topology • 273 The number of layers • 273 The direction of information travel • 275 The number of nodes in each layer • 277 Training neural networks with backpropagation • 278 Example – modeling the strength of concrete with ANNs ............................................... 281 Step 1 – collecting data • 281 Step 2 – exploring and preparing the data • 282 Step 3 – training a model on the data • 284 Step 4 – evaluating model performance • 287 Step 5 – improving model performance • 288 Understanding support vector machines ........................................................................ 294 Classification with hyperplanes • 295 The case of linearly separable data • 297 The case of nonlinearly separable data • 299 Using kernels for nonlinear spaces • 300 Example – performing OCR with SVMs ........................................................................... 302 Step 1 – collecting data • 303 Step 2 – exploring and preparing the data • 304
📄 Page
12
Table of Contents xi Step 3 – training a model on the data • 305 Step 4 – evaluating model performance • 308 Step 5 – improving model performance • 310 Changing the SVM kernel function • 310 Identifying the best SVM cost parameter • 311 Summary ......................................................................................................................... 313 Chapter 8: Finding Patterns – Market Basket Analysis Using Association Rules 315 Understanding association rules ...................................................................................... 316 The Apriori algorithm for association rule learning • 317 Measuring rule interest – support and confidence • 319 Building a set of rules with the Apriori principle • 320 Example – identifying frequently purchased groceries with association rules ................. 321 Step 1 – collecting data • 322 Step 2 – exploring and preparing the data • 323 Data preparation – creating a sparse matrix for transaction data • 324 Visualizing item support – item frequency plots • 328 Visualizing the transaction data – plotting the sparse matrix • 330 Step 3 – training a model on the data • 331 Step 4 – evaluating model performance • 335 Step 5 – improving model performance • 339 Sorting the set of association rules • 340 Taking subsets of association rules • 341 Saving association rules to a file or data frame • 342 Using the Eclat algorithm for greater efficiency • 343 Summary ........................................................................................................................ 345 Chapter 9: Finding Groups of Data – Clustering with k-means 347 Understanding clustering ............................................................................................... 348 Clustering as a machine learning task • 348
📄 Page
13
Table of Contentsxii Clusters of clustering algorithms • 351 The k-means clustering algorithm • 356 Using distance to assign and update clusters • 357 Choosing the appropriate number of clusters • 362 Finding teen market segments using k-means clustering ............................................... 364 Step 1 – collecting data • 364 Step 2 – exploring and preparing the data • 365 Data preparation – dummy coding missing values • 367 Data preparation – imputing the missing values • 368 Step 3 – training a model on the data • 370 Step 4 – evaluating model performance • 373 Step 5 – improving model performance • 377 Summary ........................................................................................................................ 379 Chapter 10: Evaluating Model Performance 381 Measuring performance for classification ....................................................................... 382 Understanding a classifier’s predictions • 383 A closer look at confusion matrices • 387 Using confusion matrices to measure performance • 389 Beyond accuracy – other measures of performance • 391 The kappa statistic • 393 The Matthews correlation coefficient • 397 Sensitivity and specificity • 400 Precision and recall • 401 The F-measure • 403 Visualizing performance tradeoffs with ROC curves • 404 Comparing ROC curves • 409 The area under the ROC curve • 412 Creating ROC curves and computing AUC in R • 413 Estimating future performance ...................................................................................... 416 The holdout method • 417
📄 Page
14
Table of Contents xiii Cross-validation • 421 Bootstrap sampling • 425 Summary ........................................................................................................................ 427 Chapter 11: Being Successful with Machine Learning 429 What makes a successful machine learning practitioner? ............................................... 430 What makes a successful machine learning model? ........................................................ 432 Avoiding obvious predictions • 436 Conducting fair evaluations • 439 Considering real-world impacts • 443 Building trust in the model • 448 Putting the “science” in data science .............................................................................. 452 Using R Notebooks and R Markdown • 456 Performing advanced data exploration • 460 Constructing a data exploration roadmap • 461 Encountering outliers: a real-world pitfall • 464 Example – using ggplot2 for visual data exploration • 467 Summary ....................................................................................................................... 480 Chapter 12: Advanced Data Preparation 483 Performing feature engineering ...................................................................................... 484 The role of human and machine • 485 The impact of big data and deep learning • 489 Feature engineering in practice ....................................................................................... 496 Hint 1: Brainstorm new features • 497 Hint 2: Find insights hidden in text • 498 Hint 3: Transform numeric ranges • 500 Hint 4: Observe neighbors’ behavior • 501 Hint 5: Utilize related rows • 503 Hint 6: Decompose time series • 504 Hint 7: Append external data • 509
📄 Page
15
Table of Contentsxiv Exploring R’s tidyverse .................................................................................................... 511 Making tidy table structures with tibbles • 512 Reading rectangular files faster with readr and readxl • 513 Preparing and piping data with dplyr • 515 Transforming text with stringr • 520 Cleaning dates with lubridate • 526 Summary ......................................................................................................................... 531 Chapter 13: Challenging Data – Too Much, Too Little, Too Complex 533 The challenge of high-dimension data ............................................................................ 534 Applying feature selection • 536 Filter methods • 538 Wrapper methods and embedded methods • 539 Example – Using stepwise regression for feature selection • 541 Example – Using Boruta for feature selection • 545 Performing feature extraction • 548 Understanding principal component analysis • 548 Example – Using PCA to reduce highly dimensional social media data • 553 Making use of sparse data ............................................................................................... 562 Identifying sparse data • 562 Example – Remapping sparse categorical data • 563 Example – Binning sparse numeric data • 567 Handling missing data .................................................................................................... 572 Understanding types of missing data • 573 Performing missing value imputation • 575 Simple imputation with missing value indicators • 576 Missing value patterns • 577 The problem of imbalanced data ..................................................................................... 579 Simple strategies for rebalancing data • 580 Generating a synthetic balanced dataset with SMOTE • 583 Example – Applying the SMOTE algorithm in R • 584
📄 Page
16
Table of Contents xv Considering whether balanced is always better • 587 Summary ........................................................................................................................ 589 Chapter 14: Building Better Learners 591 Tuning stock models for better performance .................................................................. 592 Determining the scope of hyperparameter tuning • 593 Example – using caret for automated tuning • 598 Creating a simple tuned model • 601 Customizing the tuning process • 604 Improving model performance with ensembles ............................................................. 609 Understanding ensemble learning • 610 Popular ensemble-based algorithms • 613 Bagging • 613 Boosting • 615 Random forests • 618 Gradient boosting • 624 Extreme gradient boosting with XGBoost • 629 Why are tree-based ensembles so popular? • 636 Stacking models for meta-learning ................................................................................. 638 Understanding model stacking and blending • 640 Practical methods for blending and stacking in R • 642 Summary ........................................................................................................................ 645 Chapter 15: Making Use of Big Data 647 Practical applications of deep learning ........................................................................... 648 Beginning with deep learning • 649 Choosing appropriate tasks for deep learning • 650 The TensorFlow and Keras deep learning frameworks • 653 Understanding convolutional neural networks • 655 Transfer learning and fine tuning • 658 Example – classifying images using a pre-trained CNN in R • 659
📄 Page
17
Table of Contentsxvi Unsupervised learning and big data ................................................................................ 666 Representing highly dimensional concepts as embeddings • 667 Understanding word embeddings • 669 Example – using word2vec for understanding text in R • 671 Visualizing highly dimensional data • 680 The limitations of using PCA for big data visualization • 681 Understanding the t-SNE algorithm • 683 Example – visualizing data’s natural clusters with t-SNE • 686 Adapting R to handle large datasets ................................................................................ 691 Querying data in SQL databases • 692 The tidy approach to managing database connections • 692 Using a database backend for dplyr with dbplyr • 695 Doing work faster with parallel processing • 697 Measuring R’s execution time • 699 Enabling parallel processing in R • 699 Taking advantage of parallel with foreach and doParallel • 702 Training and evaluating models in parallel with caret • 704 Utilizing specialized hardware and algorithms • 705 Parallel computing with MapReduce concepts via Apache Spark • 706 Learning via distributed and scalable algorithms with H2O • 708 GPU computing • 710 Summary ......................................................................................................................... 712 Other Books You May Enjoy 715 Index 721
📄 Page
18
Preface Machine learning, at its core, describes algorithms that transform data into actionable intelligence. This fact makes machine learning well suited to the present-day era of big data. Without machine learning, it would be nearly impossible to make sense of the massive streams of information that are now all around us. The cross-platform, zero-cost statistical programming environment called R provides an ideal pathway to start applying machine learning. R offers powerful but easy-to-learn tools that can assist you with finding insights in your own data. By combining hands-on case studies with the essential theory needed to understand how these algorithms work, this book delivers all the knowledge you need to get started with machine learning and to apply its methods to your own projects. Who this book is for This book is aimed at people in applied fields—business analysts, social scientists, and others— who have access to data and hope to use it for action. Perhaps you already know a bit about machine learning, but have never used R; or, perhaps you know a little about R, but are new to machine learning. Maybe you are completely new to both! In any case, this book will get you up and running quickly. It would be helpful to have a bit of familiarity with basic math and program- ming concepts, but no prior experience is required. All you need is curiosity. What this book covers Chapter 1, Introducing Machine Learning, presents the terminology and concepts that define and distinguish machine learners, as well as a method for matching a learning task with the appro- priate algorithm.
📄 Page
19
Prefacexviii Chapter 2, Managing and Understanding Data, provides an opportunity to get your hands dirty working with data in R. Essential data structures and procedures used for loading, exploring, and understanding data are discussed. Chapter 3, Lazy Learning – Classification Using Nearest Neighbors, teaches you how to understand and apply a simple yet powerful machine learning algorithm to your first real-world task: iden- tifying malignant samples of cancer. Chapter 4, Probabilistic Learning – Classification Using Naive Bayes, reveals the essential concepts of probability that are used in cutting-edge spam filtering systems. You’ll learn the basics of text mining in the process of building your own spam filter. Chapter 5, Divide and Conquer – Classification Using Decision Trees and Rules, explores a couple of learning algorithms whose predictions are not only accurate, but also easily explained. We’ll apply these methods to tasks where transparency is important. Chapter 6, Forecasting Numeric Data – Regression Methods, introduces machine learning algorithms used for making numeric predictions. As these techniques are heavily embedded in the field of statistics, you will also learn the essential metrics needed to make sense of numeric relationships. Chapter 7, Black-Box Methods – Neural Networks and Support Vector Machines, covers two complex but powerful machine learning algorithms. Though the math may appear intimidating, we will work through examples that illustrate their inner workings in simple terms. Chapter 8, Finding Patterns – Market Basket Analysis Using Association Rules, exposes the algorithm used in the recommendation systems employed by many retailers. If you’ve ever wondered how retailers seem to know your purchasing habits better than you know yourself, this chapter will reveal their secrets. Chapter 9, Finding Groups of Data – Clustering with k-means, is devoted to a procedure that locates clusters of related items. We’ll utilize this algorithm to identify profiles within an online com- munity. Chapter 10, Evaluating Model Performance, provides information on measuring the success of a ma- chine learning project and obtaining a reliable estimate of the learner’s performance on future data. Chapter 11, Being Successful with Machine Learning, describes the common pitfalls faced when transitioning from textbook datasets to real world machine learning problems, as well as the tools, strategies, and soft skills needed to combat these issues.
📄 Page
20
Preface xix Chapter 12, Advanced Data Preparation, introduces the set of “tidyverse” packages, which help wrangle large datasets to extract meaningful information to aid the machine learning process. Chapter 13, Challenging Data – Too Much, Too Little, Too Complex, considers solutions to a common set of problems that can derail a machine learning project when the useful information is lost within a massive dataset, much like a needle in a haystack. Chapter 14, Building Better Learners, reveals the methods employed by the teams at the top of machine learning competition leaderboards. If you have a competitive streak, or simply want to get the most out of your data, you’ll need to add these techniques to your repertoire. Chapter 15, Making Use of Big Data, explores the frontiers of machine learning. From working with extremely large datasets to making R work faster, the topics covered will help you push the boundaries of what is possible with R, and even allow you to utilize the sophisticated tools developed by large organizations like Google for image recognition and understanding text data. What you need for this book The examples in this book were tested with R version 4.2.2 on Microsoft Windows, Mac OS X, and Linux, although they are likely to work with any recent version of R. R can be downloaded at no cost at https://cran.r-project.org/. The RStudio interface, which is described in more detail in Chapter 1, Introducing Machine Learning, is a highly recommended add-on for R that greatly enhances the user experience. The RStudio Open Source Edition is available free of charge from Posit (https://www.posit.co/) alongside a paid RStudio Pro Edition that offers priority support and additional features for commercial organizations. Download the example code files The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/ Machine-Learning-with-R-Fourth-Edition. We also have other code bundles from our rich cat- alog of books and videos available at https://github.com/PacktPublishing/. Check them out! Download the color images We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/TZ7os.
The above is a preview of the first 20 pages. Register to read the complete e-book.