REGRESSION ANALYSIS BY EXAMPLE USING R ISTUDY
REGRESSION ANALYSIS BY EXAMPLE USING R Sixth Edition Ali S. Hadi The American University in Cairo Samprit Chatterjee New York University ISTUDY
Copyright © 2024 by John Wiley & Sons, Inc. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data applied for: ISBN: 9781119830870 (HB); ePDF:9781119830887; epub: 9781119830894 Cover Design: Wiley Cover Images: © Ali S. Hadi Set in 11/13.5pt NimbusRomNo9L by Straive, Chennai, India ISTUDY
Dedicated to: The memory of my parents – A. S. H. Allegra, Martha, and Rima – S. C. It’s a gift to be simple . . . Old Shaker hymn True knowledge is knowledge of why things are as they are, and not merely what they are. Isaiah Berlin ISTUDY
CONTENTS Preface xiii About the Companion Website xvii 1 Introduction 1 1.1 What Is Regression Analysis? 1 1.2 Publicly Available Data Sets 2 1.3 Selected Applications of Regression Analysis 3 1.3.1 Agricultural Sciences 3 1.3.2 Industrial and Labor Relations 4 1.3.3 Government 5 1.3.4 History 5 1.3.5 Environmental Sciences 7 1.3.6 Industrial Production 7 1.3.7 The Space Shuttle Challenger 8 1.3.8 Cost of Health Care 8 1.4 Steps in Regression Analysis 9 1.4.1 Statement of the Problem 9 1.4.2 Selection of Potentially Relevant Variables 10 1.4.3 Data Collection 10 1.4.4 Model Specification 11 1.4.5 Method of Fitting 13 1.4.6 Model Fitting 14 1.4.7 Model Criticism and Selection 15 1.4.8 Objectives of Regression Analysis 16 vi ISTUDY
CONTENTS vii 1.5 Scope and Organization of the Book 16 Exercises 19 2 A Brief Introduction to R 21 2.1 What Is R and RStudio? 21 2.2 Installing R and RStudio 22 2.3 Getting Started With R 23 2.3.1 Command Level Prompt 23 2.3.2 Calculations Using R 24 2.3.3 Editing Your R Code 27 2.3.4 Best Practice: Object Names in R 27 2.4 Data Values and Objects in R 28 2.4.1 Types of Data Values in R 28 2.4.2 Types (Structures) of Objects in R 31 2.4.3 Object Attributes 37 2.4.4 Testing (Checking) Object Type 37 2.4.5 Changing Object Type 38 2.5 R Packages (Libraries) 39 2.5.1 Installing R Packages 39 2.5.2 Name Spaces 40 2.5.3 Updating R 40 2.5.4 Datasets in R Packages 40 2.6 Importing (Reading) Data into R Workspace 41 2.6.1 Best Practice: Working Directory 41 2.6.2 Reading ASCII (Text) Files 42 2.6.3 Reading CSV Files 44 2.6.4 Reading Excel Files 44 2.6.5 Reading Files from the Internet 45 2.7 Writing (Exporting) Data to Files 46 2.7.1 Diverting Normal R Output to a File 46 2.7.2 Saving Graphs in Files 46 2.7.3 Exporting Data to Files 47 2.8 Some Arithmetic and Other Operators 48 2.8.1 Vectors 48 2.8.2 Matrix Computations 50 2.9 Programming in R 55 2.9.1 Best Practice: Script Files 55 2.9.2 Some Useful Commands or Functions 56 2.9.3 Conditional Execution 56 2.9.4 Loops 58 2.9.5 Functions and Functionals 59 2.9.6 User-Defined Functions 61 2.10 Bibliographic Notes 66 Exercises 67 ISTUDY
viii CONTENTS 3 Simple Linear Regression 71 3.1 Introduction 71 3.2 Covariance and Correlation Coefficient 71 3.3 Example: Computer Repair Data 77 3.4 The Simple Linear Regression Model 79 3.5 Parameter Estimation 80 3.6 Tests of Hypotheses 85 3.7 Confidence Intervals 90 3.8 Predictions 91 3.9 Measuring the Quality of Fit 93 3.10 Regression Line Through the Origin 97 3.11 Trivial Regression Models 100 3.12 Bibliographic Notes 101 Exercises 101 4 Multiple Linear Regression 109 4.1 Introduction 109 4.2 Description of the Data and Model 109 4.3 Example: Supervisor Performance Data 110 4.4 Parameter Estimation 112 4.5 Interpretations of Regression Coefficients 114 4.6 Centering and Scaling 117 4.6.1 Centering and Scaling in Intercept Models 118 4.6.2 Scaling in No-Intercept Models 119 4.7 Properties of the Least Squares Estimators 121 4.8 Multiple Correlation Coefficient 122 4.9 Inference for Individual Regression Coefficients 123 4.10 Tests of Hypotheses in a Linear Model 126 4.10.1 Testing All Regression Coefficients Equal to Zero 128 4.10.2 Testing a Subset of Regression Coefficients Equal to Zero 131 4.10.3 Testing the Equality of Regression Coefficients 134 4.10.4 Estimating and Testing of Regression Parameters Under Constraints 136 4.11 Predictions 138 4.12 Summary 139 Exercises 139 Appendix: 4.A Multiple Regression in Matrix Notation 146 5 Regression Diagnostics: Detection of Model Violations 151 5.1 Introduction 151 5.2 The Standard Regression Assumptions 152 5.3 Various Types of Residuals 155 ISTUDY
CONTENTS ix 5.4 Graphical Methods 157 5.5 Graphs Before Fitting a Model 160 5.5.1 One-Dimensional Graphs 160 5.5.2 Two-Dimensional Graphs 161 5.5.3 Rotating Plots 163 5.5.4 Dynamic Graphs 164 5.6 Graphs After Fitting a Model 164 5.7 Checking Linearity and Normality Assumptions 165 5.8 Leverage, Influence, and Outliers 166 5.8.1 Outliers in the Response Variable 168 5.8.2 Outliers in the Predictors 169 5.8.3 Masking and Swamping Problems 169 5.9 Measures of Influence 172 5.9.1 Cook’s Distance 173 5.9.2 Welsch and Kuh Measure 174 5.9.3 Hadi’s Influence Measure 174 5.10 The Potential–Residual Plot 177 5.11 Regression Diagnostics in R 178 5.12 What to Do with the Outliers? 178 5.13 Role of Variables in a Regression Equation 180 5.13.1 Added-Variable Plot 180 5.13.2 Residual Plus Component Plot 181 5.14 Effects of an Additional Predictor 184 5.15 Robust Regression 186 Exercises 186 6 Qualitative Variables as Predictors 193 6.1 Introduction 193 6.2 Salary Survey Data 194 6.3 Interaction Variables 197 6.4 Systems of Regression Equations: Comparing Two Groups 202 6.4.1 Models with Different Slopes and Different Intercepts 203 6.4.2 Models with Same Slope and Different Intercepts 210 6.4.3 Models with Same Intercept and Different Slopes 212 6.5 Other Applications of Indicator Variables 213 6.6 Seasonality 214 6.7 Stability of Regression Parameters Over Time 215 Exercises 218 7 Transformation of Variables 225 7.1 Introduction 225 7.2 Transformations to Achieve Linearity 227 7.3 Bacteria Deaths Due to X-Ray Radiation 230 ISTUDY
x CONTENTS 7.3.1 Inadequacy of a Linear Model 231 7.3.2 Logarithmic Transformation for Achieving Linearity 233 7.4 Transformations to Stabilize Variance 234 7.5 Detection of Heteroscedastic Errors 239 7.6 Removal of Heteroscedasticity 241 7.7 Weighted Least Squares 243 7.8 Logarithmic Transformation of Data 244 7.9 Power Transformation 247 7.10 Summary 249 Exercises 250 8 Weighted Least Squares 257 8.1 Introduction 257 8.2 Heteroscedastic Models 258 8.2.1 Supervisors Data 259 8.2.2 College Expense Data 260 8.3 Two-Stage Estimation 262 8.4 Education Expenditure Data 264 8.5 Fitting a Dose–Response Relationship Curve 273 Exercises 275 9 The Problem of Correlated Errors 277 9.1 Introduction: Autocorrelation 277 9.2 Consumer Expenditure and Money Stock 278 9.3 Durbin–Watson Statistic 281 9.4 Removal of Autocorrelation by Transformation 283 9.5 Iterative Estimation with Autocorrelated Errors 287 9.6 Autocorrelation and Missing Variables 288 9.7 Analysis of Housing Starts 289 9.8 Limitations of the Durbin–Watson Statistic 292 9.9 Indicator Variables to Remove Seasonality 294 9.10 Regressing Two Time Series 296 Exercises 298 10 Analysis of Collinear Data 301 10.1 Introduction 301 10.2 Effects of Collinearity on Inference 302 10.3 Effects of Collinearity on Forecasting 308 10.4 Detection of Collinearity 311 10.4.1 Simple Signs of Collinearity 312 10.4.2 Variance Inflation Factors 315 10.4.3 The Condition Indices 318 Exercises 321 ISTUDY
CONTENTS xi 11 Working With Collinear Data 327 11.1 Introduction 327 11.2 Principal Components 328 11.3 Computations Using Principal Components 332 11.4 Imposing Constraints 335 11.5 Searching for Linear Functions of the 𝛽’s 338 11.6 Biased Estimation of Regression Coefficients 342 11.7 Principal Components Regression 343 11.8 Reduction of Collinearity in the Estimation Data 345 11.9 Constraints on the Regression Coefficients 347 11.10 Principal Components Regression: A Caution 348 11.11 Ridge Regression 351 11.12 Estimation by the Ridge Method 353 11.13 Ridge Regression: Some Remarks 358 11.14 Summary 359 11.15 Bibliographic Notes 360 Exercises 360 Appendix: 11.A Principal Components 363 Appendix: 11.B Ridge Regression 365 Appendix: 11.C Surrogate Ridge Regression 369 12 Variable Selection Procedures 371 12.1 Introduction 371 12.2 Formulation of the Problem 372 12.3 Consequences of Variables Deletion 373 12.4 Uses of Regression Equations 374 12.4.1 Description and Model Building 374 12.4.2 Estimation and Prediction 374 12.4.3 Control 375 12.5 Criteria for Evaluating Equations 376 12.5.1 Residual Mean Square 376 12.5.2 Mallows Cp 376 12.5.3 Information Criteria 377 12.6 Collinearity and Variable Selection 379 12.7 Evaluating All Possible Equations 379 12.8 Variable Selection Procedures 380 12.8.1 Forward Selection Procedure 381 12.8.2 Backward Elimination Procedure 381 12.8.3 Stepwise Method 382 12.9 General Remarks on Variable Selection Methods 382 12.10 A Study of Supervisor Performance 383 12.11 Variable Selection with Collinear Data 388 12.12 The Homicide Data 388 12.13 Variable Selection Using Ridge Regression 391 ISTUDY
xii CONTENTS 12.14 Selection of Variables in an Air Pollution Study 392 12.15 A Possible Strategy for Fitting Regression Models 398 12.16 Bibliographic Notes 400 Exercises 400 Appendix: 12.A Effects of Incorrect Model Specifications 404 13 Logistic Regression 409 13.1 Introduction 409 13.2 Modeling Qualitative Data 410 13.3 The Logit Model 410 13.4 Example: Estimating Probability of Bankruptcies 413 13.5 Logistic Regression Diagnostics 415 13.6 Determination of Variables to Retain 417 13.7 Judging the Fit of a Logistic Regression 420 13.8 The Multinomial Logit Model 422 13.8.1 Multinomial Logistic Regression 422 13.8.2 Example: Determining Chemical Diabetes 423 13.8.3 Ordinal Logistic Regression 426 13.8.4 Example: Determining Chemical Diabetes Revisited 426 13.9 Classification Problem: Another Approach 428 Exercises 430 14 Further Topics 433 14.1 Introduction 433 14.2 Generalized Linear Model 434 14.3 Poisson Regression Model 435 14.4 Introduction of New Drugs 436 14.5 Robust Regression 437 14.6 Fitting a Quadratic Model 438 14.7 Distribution of PCB in U.S. Bays 440 Exercises 443 References 445 Index 455 ISTUDY
PREFACE I have been feeling a great sense of sadness while I was working alone on this edition of the book after Professor Samprit Chatterjee, my longtime teacher, mentor, friend, and co-author, passed away in April 2021. Our first paper was published in 1986 (Chatterjee and Hadi, 1986). Samprit and I also co-authored our 1988 book (Chatterjee and Hadi, 1988) as well as several other papers. My sincere condolences to his family and friends. May God rest his soul in peace. Regression analysis has become one of the most widely used statistical tools for analyzing multifactor data. It is appealing because it provides a conceptually simple method for investigating functional relationships among variables. The standard approach in regression analysis is to take data, fit a model, and then evaluate the fit using statistics such as t,F, R2, and Durbin–Watson test. Our approach is broader. We view regression analysis as a set of data analytic techniques that examine the interrelationships among a given set of variables. The emphasis is not on formal statistical tests and probability calculations. We argue for an informal analysis directed toward uncovering patterns in the data. We have also attempted to write a book for a group of readers with diverse backgrounds. We have also tried to put emphasis on the art of data analysis rather than on the development of statistical theory. The material presented is intended for anyone who is involved in analyzing data. The book should be helpful to those who have some knowledge of the basic concepts of statistics. In the university, it could be used as a text xiii ISTUDY
xiv PREFACE for a course on regression analysis for students whose specialization is not statistics, but, who nevertheless use regression analysis quite extensively in their work. For students whose major emphasis is statistics, and who take a course on regression analysis from a book at the level of Rao (1973), Seber (1977), or Sen and Srivastava (1990), this book can be used to balance and complement the theoretical aspects of the subject with practical applications. Outside the university, this book can be profitably used by those people whose present approach to analyzing multifactor data consists of looking at standard computer output (t,F,R2, standard errors, etc.), but who want to go beyond these summaries for a more thorough analysis. We utilize most standard and some not-so-standard summary statistics on the basis of their intuitive appeal. We rely heavily on graphical repre- sentations of the data and employ many variations of plots of regression residuals. We are not overly concerned with precise probability evaluations. Graphical methods for exploring residuals can suggest model deficiencies or point to troublesome observations. Upon further investigation into their origin, the troublesome observations often turn out to be more informative than the well-behaved observations. We notice often that more information is obtained from a quick examination of a plot of residuals than from a formal test of statistical significance of some limited null hypothesis. In short, the presentation in the chapters of this book is guided by the principles and concepts of exploratory data analysis. As we mentioned in previous editions, the statistical community has been most supportive, and we have benefitted greatly from their suggestions in improving the text. Our presentation of the various concepts and techniques of regression analysis relies on carefully developed examples. In each example, we have isolated one or two techniques and discussed them in some detail. The data were chosen to highlight the techniques being presented. Although when analyzing a given set of data it is usually necessary to employ many techniques, we have tried to choose the various data sets so that it would not be necessary to discuss the same technique more than once. Our hope is that after working through the book, the reader will be ready and able to analyze their data methodically, thoroughly, and confidently. The emphasis in this book is on the analysis of data rather than on plug- ging numbers into formulas, tests of hypotheses, or confidence intervals. Therefore no attempt has been made to derive the techniques. Techniques are described, the required assumptions are given and, finally, the success of the technique in the particular example is assessed. Although deriva- tions of the techniques are not included, we have tried to refer the reader in each case to sources in which such discussion is available. Our hope is that ISTUDY
PREFACE xv some of these sources will be followed up by the reader who wants a more thorough grounding in theory. Recently there has been a qualitative change in the analysis of linear models, from model fitting to model building, from overall tests to clinical examinations of data, from macroscopic to the microscopic analysis. To do this kind of analysis a computer is essential and, in previous editions, we have assumed its availability, but did not wish to endorse or associate the book with any of the commercially available statistical packages to make it available to a wider community. We are particularly heartened by the arrival of the language R, which is available on the Internet under the General Public License (GPL). The language has excellent computing and graphical features. It is also free! For these and other reasons, I decided to introduce and use R in this edition of the book to enable the readers to use R on their own datasets and reproduce the various types of graphs and analysis presented in this book. Although a knowledge of R would certainly be helpful, no prior knowledge of R is assumed. Major changes have been made in streamlining the text, removing ambi- guities, and correcting errors pointed out by readers and others detected by the authors. Chapter 2 is new in this edition. It gives a brief but, what we believe to be, sufficient introduction to R that would enable readers to use R to carry out the regression analysis computations as well as the graphical displays presented in this edition of the book. To help the readers out, we provide all the necessary R code in the new Chapter 2 and throughout the rest of the chapters. Section 5.11 about regression diagnostics in R is new. New references have also been added. The index at the end of the book has been enhanced. The addition of the new chapter increased the number of pages. To offset this increase, data tables that are larger than 10 rows are deleted from the book because the reader can obtain them in digital forms from the Book’s Website at http://www.aucegypt.edu/faculty/hadi/RABE6. This Website contains, among other things, all the data sets that are included in this book, the R code that are used to produce the graphs and tables in this book, and more. Also, the use of R enabled us to delete the statistical tables in the appendix because the reader can now use R to compute the p-values as well as the critical values of test statistics for any desired significance level, not just the customary ones such as 0.1, 0.05, and 0.01. We have rewritten some of the exercises and added new ones at the end of the chapters. We feel that the exercises reinforce the understanding of the material in the preceding chapters. Also new to accompany this edition a ISTUDY
xvi PREFACE Solution Manual and Power Point files are available only for instructors by contacting the authors at ahadi@aucegypt.edu or ali-hadi@cornell.edu. Previous editions of this book have been translated to Persian, Korean, and Chinese languages. We are grateful to the translators Prof. H. A. Niromand, Prof. Zhongguo Zheng, Prof. Kee Young Kim, Prof. Myoungshic Jhun, Prof. Hyuncheol Kang, and Prof. Seong Keon Lee. We are fortunate to have had assistance and encouragement from several friends, colleagues, and asso- ciates. Some of our colleagues and students at New York University, Cornell University, and The American University in Cairo have used portions of the material in their courses and have shared with us their comments and comments of their students. Special thanks are due to our friend and former colleague Jeffrey Simonoff (New York University) for comments, sugges- tions, and general help. The students in our classes on regression analysis have all contributed by asking penetrating questions and demanding mean- ingful and understandable answers. Our special thanks go to Nedret Billor (Cukurova University, Turkey) and Sahar El-Sheneity (Cornell University) for their very careful reading of an earlier edition of this book. We also appreciate the comments provided by Habibollah Esmaily, Hassan Doosti, Fengkai Yang, Mamunur Rashid, Saeed Hajebi, Zheng Zhongguo, Robert W. Hayden, Marie Duggan, Sungho Lee, Hock Lin (Andy) Tai, and Junchang Ju. We also thank Lamia Abdellatif for proofreading parts of this edition, Dimple Philip for preparing the Latex style files and the corresponding PDF version, Dean Gonzalez for helping with the production of some of the figures, and Michael New for helping with the front and back covers. ALI S. HADICairo, Egypt September 2023 ISTUDY
ABOUT THE COMPANION WEBSITE This book is accompanied by a companion website. www.wiley.com/go/hadi/regression_analysis_6e This website includes: • Table of contents • Preface • Book cover • Places where you can purchase the book • Data sets • Stata, SAS or SPSS users • R users • Errata/Comments/Feedback • Solutions to Exercises xvii ISTUDY
Comments 0
Loading comments...
Reply to Comment
Edit Comment