Statistics
6
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2026-02-12

AuthorRobert Kabacoff

R in Action, Second Edition presents both the R language and the examples that make it so useful for business developers. Focusing on practical solutions, the book offers a crash course in statistics and covers elegant methods for dealing with messy and incomplete data that are difficult to analyze using traditional methods. You'll also master R's extensive graphical capabilities for exploring and presenting data visually. And this expanded second edition includes new chapters on time series analysis, cluster analysis, and classification methodologies, including decision trees, random forests, and support vector machines. Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications. About the Technology Business pros and researchers thrive on data, and R speaks the language of data analysis. R is a powerful programming language for statistical computing. Unlike general-purpose tools, R provides thousands of modules for solving just about any data-crunching or presentation challenge you're likely to face. R runs on all important platforms and is used by thousands of major corporations and institutions worldwide. About the Book R in Action, Second Edition teaches you how to use the R language by presenting examples relevant to scientific, technical, and business developers. Focusing on practical solutions, the book offers a crash course in statistics, including elegant methods for dealing with messy and incomplete data. You'll also master R's extensive graphical capabilities for exploring and presenting data visually. And this expanded second edition includes new chapters on forecasting, data mining, and dynamic report writing. What's Inside Complete R language tutorial Using R to manage, analyze, and visualize data Techniques for debugging programs and creating packages OOP in R Over 160 graphs About the Author Dr. Rob Kabacoff is a seasoned researcher and teacher who specializes in data analysis. He also maintai

Tags
No tags
ISBN: 1617291382
Publisher: Manning Publications
Publish Year: 2015
Language: 英文
Pages: 608
File Format: PDF
File Size: 19.3 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

M A N N I N G Robert I. Kabacoff SECOND EDITION IN ACTION Data analysis and graphics with R
Praise for the First Edition Lucid and engaging—this is without doubt the fun way to learn R! —Amos A. Folarin, University College London Be prepared to quickly raise the bar with the sheer quality that R can produce. —Patrick Breen, Rogers Communications Inc. An excellent introduction and reference on R from the author of the best R website. —Christopher Williams, University of Idaho Thorough and readable. A great R companion for the student or researcher. —Samuel McQuillin, University of South Carolina Finally, a comprehensive introduction to R for programmers. —Philipp K. Janert, Author of Gnuplot in Action Essential reading for anybody moving to R for the first time. —Charles Malpas, University of Melbourne One of the quickest routes to R proficiency. You can buy the book on Friday and have a working program by Monday. —Elizabeth Ostrowski, Baylor College of Medicine One usually buys a book to solve the problems they know they have. This book solves problems you didn't know you had. —Carles Fenollosa, Barcelona Supercomputing Center Clear, precise, and comes with a lot of explanations and examples…the book can be used by beginners and professionals alike, and even for teaching R! —Atef Ouni, Tunisian National Institute of Statistics A great balance of targeted tutorials and in-depth examples. —Landon Cox, 360VL Inc.Licensed to Mark Watson <nordickan@gmail.com>
iiLicensed to Mark Watson <nordickan@gmail.com>
R in Action SECOND EDITION Data analysis and graphics with R ROBERT I. KABACOFF M A N N I N G SHELTER ISLANDLicensed to Mark Watson <nordickan@gmail.com>
ivFor online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2015 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without elemental chlorine. Manning Publications Co. Development editor: Jennifer Stout 20 Baldwin Road Copyeditor: Tiffany Taylor PO Box 761 Proofreader: Toma Mulligan Shelter Island, NY 11964 Typesetter: Marija Tudor Cover designer: Marija Tudor ISBN: 9781617291388 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 20 19 18 17 16 15Licensed to Mark Watson <nordickan@gmail.com>
brief contents PART 1 GETTING STARTED ...................................................... 1 1 ■ Introduction to R 3 2 ■ Creating a dataset 20 3 ■ Getting started with graphs 46 4 ■ Basic data management 71 5 ■ Advanced data management 89 PART 2 BASIC METHODS ...................................................... 115 6 ■ Basic graphs 117 7 ■ Basic statistics 137 PART 3 INTERMEDIATE METHODS ........................................ 165 8 ■ Regression 167 9 ■ Analysis of variance 212 10 ■ Power analysis 239 11 ■ Intermediate graphs 255 12 ■ Resampling statistics and bootstrapping 279v Licensed to Mark Watson <nordickan@gmail.com>
BRIEF CONTENTSviPART 4 ADVANCED METHODS ............................................... 299 13 ■ Generalized linear models 301 14 ■ Principal components and factor analysis 319 15 ■ Time series 340 16 ■ Cluster analysis 369 17 ■ Classification 389 18 ■ Advanced methods for missing data 414 PART 5 EXPANDING YOUR SKILLS ......................................... 435 19 ■ Advanced graphics with ggplot2 437 20 ■ Advanced programming 463 21 ■ Creating a package 491 22 ■ Creating dynamic reports 513 23 ■ Advanced graphics with the lattice package 1 online onlyLicensed to Mark Watson <nordickan@gmail.com>
contents preface xvii acknowledgments xix about this book xxi about the cover illustration xxvii PART 1 GETTING STARTED ........................................... 1 1 Introduction to R 3 1.1 Why use R? 5 1.2 Obtaining and installing R 7 1.3 Working with R 7 Getting started 8 ■ Getting help 10 ■ The workspace 11 Input and output 13 1.4 Packages 15 What are packages? 15 ■ Installing a package 15 Loading a package 15 ■ Learning about a package 16 1.5 Batch processing 16 1.6 Using output as input: reusing results 17 1.7 Working with large datasets 17vii Licensed to Mark Watson <nordickan@gmail.com>
CONTENTSviii1.8 Working through an example 18 1.9 Summary 19 2 Creating a dataset 20 2.1 Understanding datasets 21 2.2 Data structures 22 Vectors 22 ■ Matrices 23 ■ Arrays 24 ■ Data frames 25 Factors 28 ■ Lists 30 2.3 Data input 32 Entering data from the keyboard 33 ■ Importing data from a delimited text file 34 ■ Importing data from Excel 37 Importing data from XML 38 ■ Importing data from the web 38 ■ Importing data from SPSS 38 ■ Importing data from SAS 39 ■ Importing data from Stata 40 ■ Importing data from NetCDF 40 ■ Importing data from HDF5 40 Accessing database management systems (DBMSs) 40 Importing data via Stat/Transfer 42 2.4 Annotating datasets 43 Variable labels 43 ■ Value labels 43 2.5 Useful functions for working with data objects 43 2.6 Summary 44 3 Getting started with graphs 46 3.1 Working with graphs 47 3.2 A simple example 49 3.3 Graphical parameters 50 Symbols and lines 51 ■ Colors 52 ■ Text characteristics 53 Graph and margin dimensions 54 3.4 Adding text, customized axes, and legends 56 Titles 56 ■ Axes 57 ■ Reference lines 60 ■ Legend 60 Text annotations 61 ■ Math annotations 63 3.5 Combining graphs 64 Creating a figure arrangement with fine control 68 3.6 Summary 70 4 Basic data management 71 4.1 A working example 71 4.2 Creating new variables 73Licensed to Mark Watson <nordickan@gmail.com>
CONTENTS ix4.3 Recoding variables 75 4.4 Renaming variables 76 4.5 Missing values 77 Recoding values to missing 78 ■ Excluding missing values from analyses 78 4.6 Date values 79 Converting dates to character variables 81 ■ Going further 81 4.7 Type conversions 81 4.8 Sorting data 82 4.9 Merging datasets 83 Adding columns to a data frame 83 ■ Adding rows to a data frame 84 4.10 Subsetting datasets 84 Selecting (keeping) variables 84 ■ Excluding (dropping) variables 84 ■ Selecting observations 85 ■ The subset() function 86 ■ Random samples 87 4.11 Using SQL statements to manipulate data frames 87 4.12 Summary 88 5 Advanced data management 89 5.1 A data-management challenge 90 5.2 Numerical and character functions 91 Mathematical functions 91 ■ Statistical functions 92 Probability functions 94 ■ Character functions 97 Other useful functions 98 ■ Applying functions to matrices and data frames 99 5.3 A solution for the data-management challenge 101 5.4 Control flow 105 Repetition and looping 105 ■ Conditional execution 106 5.5 User-written functions 107 5.6 Aggregation and reshaping 109 Transpose 110 ■ Aggregating data 110 ■ The reshape2 package 111 5.7 Summary 113Licensed to Mark Watson <nordickan@gmail.com>
CONTENTSxPART 2 BASIC METHODS .......................................... 115 6 Basic graphs 117 6.1 Bar plots 118 Simple bar plots 118 ■ Stacked and grouped bar plots 119 Mean bar plots 120 ■ Tweaking bar plots 121 Spinograms 122 6.2 Pie charts 123 6.3 Histograms 125 6.4 Kernel density plots 127 6.5 Box plots 129 Using parallel box plots to compare groups 129 ■ Violin plots 132 6.6 Dot plots 133 6.7 Summary 136 7 Basic statistics 137 7.1 Descriptive statistics 138 A menagerie of methods 138 ■ Even more methods 140 Descriptive statistics by group 142 ■ Additional methods by group 143 ■ Visualizing results 144 7.2 Frequency and contingency tables 144 Generating frequency tables 145 ■ Tests of independence 151 ■ Measures of association 152 Visualizing results 153 7.3 Correlations 153 Types of correlations 153 ■ Testing correlations for significance 156 ■ Visualizing correlations 158 7.4 T-tests 158 Independent t-test 158 ■ Dependent t-test 159 When there are more than two groups 160 7.5 Nonparametric tests of group differences 160 Comparing two groups 160 ■ Comparing more than two groups 161 7.6 Visualizing group differences 163 7.7 Summary 164Licensed to Mark Watson <nordickan@gmail.com>
CONTENTS xiPART 3 INTERMEDIATE METHODS ............................. 165 8 Regression 167 8.1 The many faces of regression 168 Scenarios for using OLS regression 169 ■ What you need to know 170 8.2 OLS regression 171 Fitting regression models with lm() 172 ■ Simple linear regression 173 ■ Polynomial regression 175 Multiple linear regression 178 ■ Multiple linear regression with interactions 180 8.3 Regression diagnostics 182 A typical approach 183 ■ An enhanced approach 187 Global validation of linear model assumption 193 Multicollinearity 193 8.4 Unusual observations 194 Outliers 194 ■ High-leverage points 195 ■ Influential observations 196 8.5 Corrective measures 198 Deleting observations 199 ■ Transforming variables 199 Adding or deleting variables 201 ■ Trying a different approach 201 8.6 Selecting the “best” regression model 201 Comparing models 202 ■ Variable selection 203 8.7 Taking the analysis further 206 Cross-validation 206 ■ Relative importance 208 8.8 Summary 211 9 Analysis of variance 212 9.1 A crash course on terminology 213 9.2 Fitting ANOVA models 215 The aov() function 215 ■ The order of formula terms 216 9.3 One-way ANOVA 218 Multiple comparisons 219 ■ Assessing test assumptions 222 9.4 One-way ANCOVA 223 Assessing test assumptions 225 ■ Visualizing the results 225 9.5 Two-way factorial ANOVA 226Licensed to Mark Watson <nordickan@gmail.com>
CONTENTSxii9.6 Repeated measures ANOVA 229 9.7 Multivariate analysis of variance (MANOVA) 232 Assessing test assumptions 234 ■ Robust MANOVA 235 9.8 ANOVA as regression 236 9.9 Summary 238 10 Power analysis 239 10.1 A quick review of hypothesis testing 240 10.2 Implementing power analysis with the pwr package 242 t-tests 243 ■ ANOVA 245 ■ Correlations 245 Linear models 246 ■ Tests of proportions 247 Chi-square tests 248 ■ Choosing an appropriate effect size in novel situations 249 10.3 Creating power analysis plots 251 10.4 Other packages 252 10.5 Summary 253 11 Intermediate graphs 255 11.1 Scatter plots 256 Scatter-plot matrices 259 ■ High-density scatter plots 261 3D scatter plots 263 ■ Spinning 3D scatter plots 265 Bubble plots 266 11.2 Line charts 268 11.3 Corrgrams 271 11.4 Mosaic plots 276 11.5 Summary 278 12 Resampling statistics and bootstrapping 279 12.1 Permutation tests 280 12.2 Permutation tests with the coin package 282 Independent two-sample and k-sample tests 283 Independence in contingency tables 285 ■ Independence between numeric variables 285 ■ Dependent two-sample and k-sample tests 286 ■ Going further 286 12.3 Permutation tests with the lmPerm package 287 Simple and polynomial regression 287 ■ Multiple regression 288 ■ One-way ANOVA and ANCOVA 289 Two-way ANOVA 290Licensed to Mark Watson <nordickan@gmail.com>
CONTENTS xiii12.4 Additional comments on permutation tests 291 12.5 Bootstrapping 291 12.6 Bootstrapping with the boot package 292 Bootstrapping a single statistic 294 ■ Bootstrapping several statistics 296 12.7 Summary 298 PART 4 ADVANCED METHODS ................................... 299 13 Generalized linear models 301 13.1 Generalized linear models and the glm() function 302 The glm() function 303 ■ Supporting functions 304 Model fit and regression diagnostics 305 13.2 Logistic regression 306 Interpreting the model parameters 308 ■ Assessing the impact of predictors on the probability of an outcome 309 Overdispersion 310 ■ Extensions 311 13.3 Poisson regression 312 Interpreting the model parameters 314 ■ Overdispersion 315 Extensions 317 13.4 Summary 318 14 Principal components and factor analysis 319 14.1 Principal components and factor analysis in R 321 14.2 Principal components 322 Selecting the number of components to extract 323 Extracting principal components 324 ■ Rotating principal components 327 ■ Obtaining principal components scores 328 14.3 Exploratory factor analysis 330 Deciding how many common factors to extract 331 Extracting common factors 332 ■ Rotating factors 333 Factor scores 336 ■ Other EFA-related packages 337 14.4 Other latent variable models 337 14.5 Summary 338 15 Time series 340 15.1 Creating a time-series object in R 343Licensed to Mark Watson <nordickan@gmail.com>
CONTENTSxiv15.2 Smoothing and seasonal decomposition 345 Smoothing with simple moving averages 345 ■ Seasonal decomposition 347 15.3 Exponential forecasting models 352 Simple exponential smoothing 353 ■ Holt and Holt-Winters exponential smoothing 355 ■ The ets() function and automated forecasting 358 15.4 ARIMA forecasting models 359 Prerequisite concepts 359 ■ ARMA and ARIMA models 361 Automated ARIMA forecasting 366 15.5 Going further 367 15.6 Summary 367 16 Cluster analysis 369 16.1 Common steps in cluster analysis 370 16.2 Calculating distances 372 16.3 Hierarchical cluster analysis 374 16.4 Partitioning cluster analysis 378 K-means clustering 378 ■ Partitioning around medoids 382 16.5 Avoiding nonexistent clusters 384 16.6 Summary 387 17 Classification 389 17.1 Preparing the data 390 17.2 Logistic regression 392 17.3 Decision trees 393 Classical decision trees 393 ■ Conditional inference trees 397 17.4 Random forests 399 17.5 Support vector machines 401 Tuning an SVM 403 17.6 Choosing a best predictive solution 405 17.7 Using the rattle package for data mining 408 17.8 Summary 413 18 Advanced methods for missing data 414 18.1 Steps in dealing with missing data 415 18.2 Identifying missing values 417Licensed to Mark Watson <nordickan@gmail.com>
CONTENTS xv18.3 Exploring missing-values patterns 418 Tabulating missing values 419 ■ Exploring missing data visually 419 ■ Using correlations to explore missing values 422 18.4 Understanding the sources and impact of missing data 424 18.5 Rational approaches for dealing with incomplete data 425 18.6 Complete-case analysis (listwise deletion) 426 18.7 Multiple imputation 428 18.8 Other approaches to missing data 432 Pairwise deletion 432 ■ Simple (nonstochastic) imputation 433 18.9 Summary 433 PART 5 EXPANDING YOUR SKILLS ............................. 435 19 Advanced graphics with ggplot2 437 19.1 The four graphics systems in R 438 19.2 An introduction to the ggplot2 package 439 19.3 Specifying the plot type with geoms 443 19.4 Grouping 447 19.5 Faceting 450 19.6 Adding smoothed lines 453 19.7 Modifying the appearance of ggplot2 graphs 455 Axes 455 ■ Legends 457 ■ Scales 458 ■ Themes 460 Multiple graphs per page 461 19.8 Saving graphs 462 19.9 Summary 462 20 Advanced programming 463 20.1 A review of the language 464 Data types 464 ■ Control structures 470 ■ Creating functions 473 20.2 Working with environments 475 20.3 Object-oriented programming 477 Generic functions 477 ■ Limitations of the S3 model 479 20.4 Writing efficient code 479Licensed to Mark Watson <nordickan@gmail.com>
CONTENTSxvi20.5 Debugging 483 Common sources of errors 483 ■ Debugging tools 484 Session options that support debugging 486 20.6 Going further 489 20.7 Summary 490 21 Creating a package 491 21.1 Nonparametric analysis and the npar package 492 Comparing groups with the npar package 494 21.2 Developing the package 496 Computing the statistics 497 ■ Printing the results 500 Summarizing the results 501 ■ Plotting the results 504 Adding sample data to the package 505 21.3 Creating the package documentation 506 21.4 Building the package 508 21.5 Going further 512 21.6 Summary 512 22 Creating dynamic reports 513 22.1 A template approach to reports 515 22.2 Creating dynamic reports with R and Markdown 517 22.3 Creating dynamic reports with R and LaTeX 522 22.4 Creating dynamic reports with R and Open Document 525 22.5 Creating dynamic reports with R and Microsoft Word 527 22.6 Summary 531 afterword Into the rabbit hole 532 appendix A Graphical user interfaces 535 appendix B Customizing the startup environment 538 appendix C Exporting data from R 540 appendix D Matrix algebra in R 542 appendix E Packages used in this book 544 appendix F Working with large datasets 551 appendix G Updating an R installation 555 references 558 index 563 bonus chapter 23 Advanced graphics with the lattice package 1 available online at manning.com/RinActionSecondEdition also available in this eBookLicensed to Mark Watson <nordickan@gmail.com>
preface What is the use of a book, without pictures or conversations? —Alice, Alice’s Adventures in Wonderland It’s wondrous, with treasures to satiate desires both subtle and gross; but it’s not for the timid. —Q, “Q Who?” Stark Trek: The Next Generation When I began writing this book, I spent quite a bit of time searching for a good quote to start things off. I ended up with two. R is a wonderfully flexible platform and lan- guage for exploring, visualizing, and understanding data. I chose the quote from Alice’s Adventures in Wonderland to capture the flavor of statistical analysis today—an interactive process of exploration, visualization, and interpretation. The second quote reflects the generally held notion that R is difficult to learn. What I hope to show you is that is doesn’t have to be. R is broad and powerful, with so many analytic and graphic functions available (more than 50,000 at last count) that it easily intimidates both novice and experienced users alike. But there is rhyme and rea- son to the apparent madness. With guidelines and instructions, you can navigate the tremendous resources available, selecting the tools you need to accomplish your work with style, elegance, efficiency—and more than a little coolness. I first encountered R several years ago, when applying for a new statistical consult- ing position. The prospective employer asked in the pre-interview material if I was conversant in R. Following the standard advice of recruiters, I immediately said yes,xvii Licensed to Mark Watson <nordickan@gmail.com>
PREFACExviiiand set off to learn it. I was an experienced statistician and researcher, had 25 years experience as an SAS and SPSS programmer, and was fluent in a half dozen program- ming languages. How hard could it be? Famous last words. As I tried to learn the language (as fast as possible, with an interview looming), I found either tomes on the underlying structure of the language or dense treatises on specific advanced statistical methods, written by and for subject-matter experts. The online help was written in a spartan style that was more reference than tutorial. Every time I thought I had a handle on the overall organization and capabilities of R, I found something new that made me feel ignorant and small. To make sense of it all, I approached R as a data scientist. I thought about what it takes to successfully process, analyze, and understand data, including ■ Accessing the data (getting the data into the application from multiple sources) ■ Cleaning the data (coding missing data, fixing or deleting miscoded data, trans- forming variables into more useful formats) ■ Annotating the data (in order to remember what each piece represents) ■ Summarizing the data (getting descriptive statistics to help characterize the data) ■ Visualizing the data (because a picture really is worth a thousand words) ■ Modeling the data (uncovering relationships and testing hypotheses) ■ Preparing the results (creating publication-quality tables and graphs) Then I tried to understand how I could use R to accomplish each of these tasks. Because I learn best by teaching, I eventually created a website (www.statmethods.net) to document what I had learned. Then, about a year later, Marjan Bace, Manning’s publisher, called and asked if I would like to write a book on R. I had already written 50 journal articles, 4 technical manuals, numerous book chapters, and a book on research methodology, so how hard could it be? At the risk of sounding repetitive—famous last words. A year after the first edition came out in 2011, I started working on the second edi- tion. The R platform is evolving, and I wanted to describe these new developments. I also wanted to expand the coverage of predictive analytics and data mining—impor- tant topics in the world of big data. Finally, I wanted to add chapters on advanced data visualization, software development, and dynamic report writing. The book you’re holding is the one that I wished I had so many years ago. I have tried to provide you with a guide to R that will allow you to quickly access the power of this great open source endeavor, without all the frustration and angst. I hope you enjoy it. P.S. I was offered the job but didn’t take it. But learning R has taken my career in directions that I could never have anticipated. Life can be funny.Licensed to Mark Watson <nordickan@gmail.com>
acknowledgments A number of people worked hard to make this a better book. They include ■ Marjan Bace, Manning’s publisher, who asked me to write this book in the first place. ■ Sebastian Stirling and Jennifer Stout, development editors on the first and sec- ond editions, respectively. Each spent many hours helping me organize the material, clarify concepts, and generally make the text more interesting. ■ Pablo Domínguez Vaselli, technical proofreader, who helped uncover areas of confusion and provided an independent and expert eye for testing code. I came to rely on his vast knowledge, careful reviews, and considered judgment. ■ Olivia Booth, the review editor, who helped obtain reviewers and coordinate the review process. ■ Mary Piergies, who helped shepherd this book through the production process, and her team of Tiffany Taylor, Toma Mulligan, Janet Vail, David Novak, and Marija Tudor. ■ The peer reviewers who spent hours of their own time carefully reading through the material, finding typos, and making valuable substantive sugges- tions: Bryce Darling, Christian Theil Have, Cris Weber, Deepak Vohra, Dwight Barry, George Gaines, Indrajit Sen Gupta, Dr. L. Duleep Kumar Samuel, Mahesh Srinivason, Marc Paradis, Peter Rabinovitch, Ravishankar Rajagopalan, Samuel Dale McQuillin, and Zekai Otles. ■ The many Manning Early Access Program (MEAP) participants who bought the book before it was finished, asked great questions, pointed out errors, and made helpful suggestions.xix Licensed to Mark Watson <nordickan@gmail.com>