Statistics
40
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2025-12-07

AuthorAndy Hector

Supporting website: https - colon forward slash forward slash – www.plantecol.org/contemporary-analysis-for-ecology/ Statistical methods are a key tool for all scientists working with data, but learning the basics continues to challenge successive generations of students. This accessible textbook provides an up-to-date introduction to the classical techniques and modern extensions of linear model analysis-one of the most useful approaches for investigating scientific data in the life and environmental sciences. While some of the foundational analyses (e.g. t tests, regression, ANOVA) are as useful now as ever, best practice moves on and there are many new general developments that offer great potential. The book emphasizes an estimation-based approach that takes account of recent criticisms of over-use of probability values and introduces the alternative approach that uses information criteria. This new edition includes the latest advances in R and related software and has been thoroughly “road-tested” over the last decade to create a proven textbook that teaches linear and generalized linear model analysis to students of ecology, evolution, and environmental studies (including worked analyses of data sets relevant to all three disciplines). While R is used throughout, the focus remains firmly on statistical analysis.

Tags
No tags
ISBN: 0198798180
Publisher: Oxford University Press
Publish Year: 2021
Language: 英文
Pages: 277
File Format: PDF
File Size: 12.6 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

(This page has no text content)
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi The New Statistics with R
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi The New Statistics with R An Introduction for Biologists Second Edition ANDY HECTOR Department of Plant Sciences and Linacre College, University of Oxford, UK 3
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi 3 Great Clarendon Street, Oxford, OX2 6DP, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © Andy Hector 2021 The moral rights of the author have been asserted First Edition published in 2015 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: 2021931174 ISBN 978–0–19–879817–0 (hbk.) ISBN 978–0–19–879818–7 (pbk.) DOI: 10.1093/oso/9780198798170.001.0001 Printed in Great Britain by Bell & Bain Ltd., Glasgow Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi I dedicated the first edition of this book to the memory of Christine Müller. This new edition is dedicated to Lindsay and Rowan.
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi Acknowledgements The original version of this book was begun at the end of 2011 while I was on sabbatical as a visiting researcher in the computational ecology group at Microsoft Research in Cambridge—my thanks to Drew Purves and colleagues for their support. This second edition was partly written during my sabbatical in 2019/20, sadly largely under covid-19 restric- tions. However, before lockdown I made some important progress during stays at Obertschappina—thanks Roland and Petra—and on a visit to the Cedar Creek Ecosystem Science Reserve—for which I thank Forest Isbell, Dave Tilman, and the amazing group of ecologists at the University of Minnesota. Several people were instrumental in helping cultivate my initial interest in statistical analysis. I was first introduced to experiments during my final-year project with Phil Grime and colleagues at the UCPE at Sheffield University. Shortly afterwards, one of the most rewarding parts of my PhD at Imperial College was learning statistics (and the GLIM software) from Mick Crawley. Bernhard Schmid shared this interest and enthusiasm and taught me a lot while I was a postdoc on the BIODEPTH project and, later, when we worked together at the Institute for Environmental Sciences at the University of Zurich (sorry for forsakingGenstat for R, Bernhard!). Here in Oxford I have continued to discuss and learn about statistics partly through the generosity of Geoff Nicholls. I have also benefited from sometimes brief but important discussions with several other statisticians during training courses, after visiting talks,
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi viii ACKNOWLEDGEMENTS and the like, including Douglas Bates, Andrew Gelman (over a game of Quincunx), Martin Maechler, Peter McCullagh, John Nelder, José Pin- heiro, Bill Venables, and Hadley Wickham. My apologies to them for any misunderstandings that make it into this book. Many group members helped me delve further into statistics with R, including someof thematerial covered in this book. Iwould like to thank all current and past groupmembers, but particularly Robi Bagchi, Stefanie von Felten, Yann Hautier, Charlie Marsh, Chris Philipson, Matteo Tanadini, Sean Tuck, Maja Weilenmann, and Mikey O’Brien. I have also learned a lot fromcollaborating on papers on statisticswith several colleagues, including Tom Bell, Jarrett Byrnes, John Connelly, Laura Dee, Forest Isbell, Marc Kéry, Michel Loreau, and Alain Zuur. The content of this book is based on teaching materials developed over the last two decades at Imperial College, the University of Zurich, and here at Oxford, where I teach statistics at the Bachelor, Masters, and PhD levels. Thanks to everyone involved—particularly the many demonstrators (TAs). Many people helped find errors in the first edition of this book—I have tried to correct them and acknowledge the spotters at the R café website (no doubt there will be more to add for this second edition). In particular, my thanks to Ben Bolker for his constructive criticism of the first edition of this book. At OUP, thanks go to Ian, Lucy, Bethany, and Charlie for making this book and this second edition possible. Also, thanks to Douglas Meekison who has skilfully copyedited the manuscript and Sumintra Gaur has been project manager for this book. Finally, thank you—and sorry—to anyone who has slipped my mind as I rush again to meet the book deadline! Andy Hector, Oxford, October 2020.
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi Contents Chapter 1: Introduction 1 1.1 Introduction to the second edition 1 1.2 The aim of this book 2 1.3 Changes in the second edition 3 1.4 The R programming language for statistics and graphics 4 1.5 Scope 4 1.6 What is not covered 5 1.7 The approach 5 1.8 The new statistics? 6 1.9 Getting started 6 1.10 References 7 Chapter 2: Motivation 9 2.1 A matter of life and death 9 2.2 Summary: Statistics 12 2.3 Summary: R 13 2.4 References 13 Chapter 3: Description 15 3.1 Introduction 15 3.2 Darwin’s maize pollination data 16 3.3 Summary: Statistics 28 3.4 Summary: R 28 3.5 References 28 Chapter 4: Reproducible Research 29 4.1 The reproducibility crisis 29 4.2 R scripts 30
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi x CONTENTS 4.3 Analysis notebooks 32 4.4 R Markdown 32 4.5 Summary: Statistics 37 4.6 Summary: R 37 4.7 References 37 Chapter 5: Estimation 39 5.1 Introduction 39 5.2 Quick tests 40 5.3 Differences between groups 41 5.4 Standard deviations and standard errors 43 5.5 The normal distribution and the central limit theorem 45 5.6 Confidence intervals 48 5.7 Summary: Statistics 50 5.8 Summary: R 50 Appendix 5a: R code for Fig. 5.1 50 Chapter 6: Linear Models 51 6.1 Introduction 51 6.2 A linear-model analysis for comparing groups 52 6.3 Standard error of the difference 57 6.4 Confidence intervals 58 6.5 Answering Darwin’s question 60 6.6 Relevelling to get the other treatment mean and standard error 62 6.7 Assumption checking 63 6.8 Summary: Statistics 66 6.9 Summary: R 67 6.10 Reference 67 Appendix 6a: R graphics 67 Appendix 6b: Robust linear models 68 Appendix 6c: Exercise 68 Chapter 7: Regression 71 7.1 Introduction 71 7.2 Linear regression 72 7.3 The Janka timber hardness data 73 7.4 Correlation 75 7.5 Linear regression in R 75
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi CONTENTS xi 7.6 Assumptions 78 7.7 Summary: Statistics 82 7.8 Summary: R 83 7.9 Reference 83 Appendix 7a: R graphics 83 Appendix 7b: Least squares linear regression 84 Chapter 8: Prediction 85 8.1 Introduction 85 8.2 Predicting timber hardness from wood density 85 8.3 Confidence intervals and prediction intervals 90 8.4 Summary: Statistics 94 8.5 Summary: R 95 Chapter 9: Testing 97 9.1 Significance testing: Time for t 97 9.2 Student’s t-test: Darwin’s maize 98 9.3 Summary: Statistics 106 9.4 Summary: R 106 9.5 References 106 Chapter 10: Intervals 107 10.1 Comparisons using estimates and intervals 107 10.2 Estimation-based analysis 108 10.3 Descriptive statistics 109 10.4 Inferential statistics 113 10.5 Relating different types of interval and error bar 119 10.6 Summary: Statistics 124 10.7 Summary: R 125 10.8 References 125 Chapter 11: Analysis of Variance 127 11.1 ANOVA tables 127 11.2 ANOVA tables: Darwin’s maize 128 11.3 Hypothesis testing: F-values 132 11.4 Two-way ANOVA 135 11.5 Summary 137 11.6 Reference 138
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi xii CONTENTS Chapter 12: Factorial Designs 139 12.1 Introduction 139 12.2 Factorial designs 139 12.3 Comparing three or more groups 142 12.4 Two-way ANOVA (no interaction) 145 12.5 Additive treatment effects 148 12.6 Interactions: Factorial ANOVA 152 12.7 Summary: Statistics 158 12.8 Summary: R 159 12.9 References 159 Appendix 12a: Code for Fig. 12.3 160 Chapter 13: Analysis of Covariance 161 13.1 ANCOVA 161 13.2 The agricultural pollution data 162 13.3 ANCOVA with water stress and low-level ozone 165 13.4 Interactions in ANCOVA 171 13.5 General linear models 172 13.6 Summary 175 13.7 References 176 Chapter 14: Linear Model Complexities 177 14.1 Introduction 177 14.2 Analysis of variance for balanced designs 178 14.3 Analysis of variance with unbalanced designs 180 14.4 ANOVA tables versus coefficients: When F and t can disagree 184 14.5 Marginality of main effects and interactions 186 14.6 Summary 192 14.7 References 192 Chapter 15: Generalized Linear Models 195 15.1 GLMs 195 15.2 The trouble with transformations 196 15.3 The Box–Cox power transform 200 15.4 Generalized Linear Models in R 203 15.5 Summary: Statistics 208 15.6 Summary: R 208 15.7 References 208
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi CONTENTS xiii Chapter 16: GLMs for Count Data 209 16.1 Introduction 209 16.2 GLMs for count data 210 16.3 Quasi-maximum likelihood 213 16.4 Summary 215 Chapter 17: Binomial GLMs 217 17.1 Binomial counts and proportion data 217 17.2 The beetle data 218 17.3 GLM for binomial counts 220 17.4 Alternative link functions 225 17.5 Summary: Statistics 228 17.6 Summary: R 228 17.7 Reference 228 Chapter 18: GLMs for Binary Data 229 18.1 Binary data 229 18.2 The wells data set for the binary GLM example 230 18.3 Centering 236 18.4 Summary 238 18.5 References 238 Chapter 19: Conclusions 239 19.1 Introduction 239 19.2 A binomial GLM analysis of the Challenger binary data 239 19.3 Recommendations 246 19.4 Where next? 249 19.5 Further reading 249 19.6 The R café 249 19.7 References 250 Chapter 20: A Very Short Introduction to R 251 20.1 Installing R 251 20.2 Installing RStudio 253 20.3 R packages 254 20.4 The R language 254 Index 259
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi
OUP CORRECTED PROOF – FINAL, 18/5/2021, SPi 1 Introduction 1.1 Introduction to the second edition B ack in 2015, I opened the introduction to the first edition of this book as follows: Unlikely as it may seem, statistics is currently a sexy subject. Nate Silver’s success in out-predicting the political pundits in the last US election drew high-profile press coverage across the globe (and his book many readers). Statistics may not remain sexy but it will always be useful. It is a key component in the scientific toolbox and one of the main ways we have of describing the natural world and of finding out how it works. In most areas of science, statistics is essential. So much has changed over the last five years. Initially, I thought this introduction to the second edition would discuss the subsequent failure of statistics to predict the Brexit referendum and Trump election results. However, I ended up working on this second edition under lockdown due to the COVID-19 pandemic. I’m not sure if statistics is still ‘sexy’ but it is certainly still prominent in our lives. Modelling, much of it statistical, provides predictions of the spread of COVID-19, and sampling is key to estimating fundamental parameters like the reproductive number, denoted The New Statistics with R: An Introduction for Biologists. Second Edition. Andy Hector, Oxford University Press. © Andy Hector 2021. DOI: 10.1093/oso/9780198798170.003.0001
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi 2 THE NEW STATISTICS WITH R, SECOND EDITION (coincidentally) R—the number of people each person with COVID-19 in turn infects. 1.2 The aim of this book This book is intended to introduce one of themost useful types of statistical analysis to researchers, particularly in the life and environmental sciences: linear models and their generalized-linear-model (GLM) extensions. My aim is to get across the essence of the statistical ideas necessary to intel- ligently apply and interpret these models in a contemporary (‘new’) way. I hope it will be of use to students at both undergraduate and postgraduate levels and to researchers interested in learning more about statistics (or in switching to the software packages used here, R and RStudio). The approach is therefore not primarily mathematical, and makes limited use of equations—they are easily found in numerous statistics textbooks and on the internet if you want them. I have also kept citations to a minimum and give them at the end of the most relevant chapter (there is no overall bibliography). The approach is to learn by doing, through the analysis of real data sets. That means using a statistical software package, in this case the R programming language for statistics and graphics (for the reasons given below). It also requires data. In fact, most scientists only start to take an interest in statistics once they have their own data. In most science degrees that comes late in the day, making the teaching of introductory statisticsmore challenging. Students studying for research degrees (Masters and PhDs) are generally much more motivated to learn statistics since they know it will be essential for the analysis of their data. The next best thing to workingwith our owndata is toworkwith some carefully selected examples from the literature. I have used some data from my own research but I have mainly tried to find small, relevant data sets that have been analysed in an interesting way (preferably by a qualified statistician). Most of them are from the life and environmental sciences. I am very grateful to all of the people who have helped collect these data and developed the analyses (they are named in the appropriate chapters as the data and example are
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi INTRODUCTION 3 introduced). For convenience, I have tried to use data sets that are available within the R software. 1.3 Changes in the second edition The first edition of this book was written following standard procedure to supply a Word document of the text of each chapter plus files of any figures. This proved an inefficient and error-prone method with all the copy–pasting between R scripts and the word processing file. This second edition has been entirely rewritten using the R Markdown package to produce a PDF file of each chapter along with the TeX file that generates it (as I understand it, subcontractors will then use LaTeX to apply the book format). Writing the second edition like this should be a smarter, more efficient, and hopefully less error-prone way to work. In the process, the book has changed in many ways. Based on my experience in teaching the Quantitative Methods for Biology course at Oxford, the content has been divided up into a greater number of bite-size topics that will hopefully prove more digestible for students and more useful to teachers. In part because the book was written using the R Markdown package, I now drive R using the RStudio software (it also provides a standard interface on all platforms and lots of other great support materials, like the R cheat sheets). Every chapter has been rewritten but there are also entirely new chapters, one giving an openingmotivational example, one on reproducible research (using the R Markdown package), and another on some of the complexities of linear-model analysis that I skipped over in the first edition. There are now separate chapters on GLMs for the analysis of different types of non-normal data. The first edition also contained chapters on mixed- effects and generalized linear mixed-effects models (GLMMs). These have been dropped from the second edition—partly due to the space limits but also because some reviewers and readers felt that one chapter was just not enough even for a short introduction to mixed-effects models. Furthermore, the example GLMM no longer ran using later versions of the software.
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi 4 THE NEW STATISTICS WITH R, SECOND EDITION 1.4 The R programming language for statistics and graphics R is now the principal software for statistics, graphics, and programming in many areas of science, both within academia and outside (many large companies use R). There are several reasons for this, including: • R is a product of the statistical community: it is written by the experts. • R is free: it costs nothing to download and use, facilitating collaboration. • R is multiplatform: versions exist for Windows, Mac, and Linux. • R is open-source software that can be easily extended by the R community. • R is statistical software, a graphics package, and a programming language all in one (as we’ll see, you can now even produce books, blogs, and websites from R). 1.5 Scope Statistics can sometimes seem like a huge, bewildering, and intimidating collection of tests. To avoid this I have chosen to focus on the linear- model framework as probably the single most useful part of statistics (at least for researchers in the environmental and life sciences). The book starts by introducing several different variations of the basic linear-model analysis (analysis of variance, linear regression, analysis of covariance, etc.). I then introduce an extension: generalized linear models for data with non-normal distributions. The advantage of following the linear-model approach is that a wide range of different types of data and experimental designs can be analysed with very similar approaches. In particular, all of the analyses covered in this book can be performed in R using only two main functions, one for linear models (the lm() function) and one for GLMs (the glm() function), together with a set of generic functions that extract different aspects of the results (confidence intervals etc.).
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi INTRODUCTION 5 1.6 What is not covered This book is primarily about statistics (linear models), not the R software. For that, OUP offers introductory volumes by Beckerman et al. (2017) and Petchey et al. (2021). Statistics is a huge subject, so the limited size of the book precluded the inclusion of many topics, and the coverage is limited to linear models and GLMs. There was no space for non-linear regression approaches, generalized additive models (GAMs). Because of the focus on an estimation-based approach, I have not included non-parametric statistics. Experimental design is covered briefly and integrated into the relevant chapters.The use of information criteria andmultimodel inference are briefly introduced.The basics of Bayesian statistics is also a book-length project in its own right (e.g. Korner-Nievergelt et al. 2017). 1.7 The approach There are several different general approaches within statistics (frequentist, Bayesian, information theory, etc.) and there are many subspecies within these schools of thought. Most of the methods included in this book are usually described as belonging to ‘classical frequentist statistics’. However, this approach, and the probability values that are so widely used within it, has come under increasing criticism. In particular, statisticians often accuse scientists of focusing far too much on P-values and not enough on effect sizes. This is strange, as the effect sizes—the estimates and intervals—are directly related to what we measure during our research. I don’t know any scientists who study P-values! For that reason, I have tried to take an estimation-based approach that focuses on estimates and confidence intervals wherever possible. Styles of analysis vary (and fashions change over time). Because of this, I have tried to be frank about some of my personal preferences used in this book. In addition to making wide use of estimates and intervals, I have also tried to emphasize the use of graphs for exploring data and presenting results. I have tried to encourage the use