(This page has no text content)
The New Statistics with R 9780198729068-Hector.indb 1 04/12/14 11:39 AM
9780198729068-Hector.indb 2 04/12/14 11:39 AM
The New Statistics with R An Introduction for Biologists ANDY HECTOR Professor of Ecology Department of Plant Sciences University of Oxford 1 The New Statistics with R. Andy Hector. © Andy Hector 2015. Published 2015 by Oxford University Press. 9780198729068-Hector.indb 3 04/12/14 11:39 AM
1 Great Clarendon Street, Oxford, OX2 6DP, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © Andy Hector 2015 The moral rights of the author have been asserted Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: 2014949047 ISBN 978–0–19–872905–1 (hbk.) ISBN 978–0–19–872906–8 (pbk.) Printed and bound by CPI Group (UK) Ltd, Croydon, CR0 4YY Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work. 9780198729068-Hector.indb 4 04/12/14 11:39 AM
I dedicate this book to the memory of Christine Müller. 9780198729068-Hector.indb 5 04/12/14 11:39 AM
9780198729068-Hector.indb 6 04/12/14 11:39 AM
Acknowledgements First, I would like to thank Drew Purves, Steve Emmett, and their groups at Microsoft Research in Cambridge, as I made a substantial start to this book while on sabbatical as a visiting researcher in the computational ecol- ogy group there at the end of 2011. Several people were instrumental in helping cultivate my interest in sta- tistical analysis. I was first introduced to experiments during my final-year project with Phil Grime and colleagues at the Unit of Comparative Plant Ecology at Sheffield University. Shortly afterwards, one of the most reward- ing parts of my PhD at Imperial College was learning statistics (and GLIM) from Mick Crawley. Bernhard Schmid shared this interest and enthusiasm and taught me a lot while I was a post-doc on the BIODEPTH project and later when we worked together at the Institute for Environmental Sciences at the University of Zurich (sorry for forsaking GenStat for R Bernhard!). I benefitted from discussions with several statisticians during training courses or after visiting lectures including Douglas Bates, Martin Mächler, John Nelder, José Pinheiro, Bill Venables, and Hadley Wickham. Many PhD students and post-docs helped me delve further into statistics with R, including some of the material covered in this book (and commented on draft chapters). I would like to thank all current and past group members, but particularly Robi Bagchi, Juliette Chamagne, Stefanie von Felten, Yann Hautier, Mikey O’Brien, Chris Philipson, Matteo Tanadini, and Sean Tuck. The content of this book is based on teaching materials developed at the University of Zurich and the University of Oxford where I currently teach 9780198729068-Hector.indb 7 04/12/14 11:39 AM
viii ACKNOWLEDGEMENTS much of the statistical content for the Quantitative Methods for Biology course—my thanks to the course participants at both institutions and to the Oxford QM tutors particularly Yvonne Griffiths for her fine-tooth comb! I learned a lot from collaborating on papers on statistical analysis with several colleagues including Tom Bell, Jarrett Byrnes, John Connelly, For- est Isbell, Marc Kéry, Michel Loreau, Owen Petchey, and Alain Zuur. Thanks to Ben Bolker and Vincent Calcagno for discussions on GLMMs and multimodel inference. I would also like to thank Maja Weilenmann and especially Lindsay Turnbull. Thanks to Lucy and Ian at OUP. This pro- ject would not have been possible without the generous work of the many people who have helped develop R. Finally, thank you—and sorry—to anyone who has slipped my mind as I rush to meet the book deadline! 9780198729068-Hector.indb 8 04/12/14 11:39 AM
The New Statistics with R. Andy Hector. © Andy Hector 2015. Published 2015 by Oxford University Press. 1 Introduction Unlikely as it may seem, statistics is currently a sexy subject. Nate Silver’s success in out-predicting the political pundits in the last US election drew high-profile press coverage across the globe. Statistics may not remain sexy but it will always be useful. It is a key component in the scientific toolbox and one of the main ways we have of describing the natural world and of finding out how it works. In most areas of science, statistics is essential. In some ways this is an odd state of affairs. Mathematical statisticians gener- ally don’t require skills from other areas of science in the same way that we scientists need skills from their domain. We have to learn some statistics in addition to our core area of scientific interest. Obviously there are limits to how far most of us can go. This book is intended to introduce some of the most useful applied statistical analyses to researchers, particularly in the life and environmental sciences. 1.1 The aim of this book My aim is to get across the essence of the statistical ideas necessary to intelligently apply linear models (and some of their extensions) within relevant areas of the life and environmental sciences. I hope it will be of use to students at both undergraduate and post-graduate level and researchers interested in learning more about statistics (or in switching to the software package used here). The approach is therefore not mathe- matical. I have minimized the number of equations—they are in numer- ous statistics textbooks and on the internet if you want them—and the 9780198729068-Hector.indb 1 04/12/14 11:39 AM
2 NEW STATISTICS WITH R statistical concepts and theory are explained in boxes to try and avoid disrupting the flow of the main text. I have also kept citations to a mini- mum and concentrated them in the text boxes and final chapter (there is no Bibliography). Instead, the approach is to learn by doing through the analysis of real data sets. That means using a statistical software package, in this case the R programming language for statistics and graphics (for the reasons given below). It also requires data. In fact, most of us only start to take an interest in statistics once we have (or know we soon will have) data. In most science degrees that comes late in the day, making the teaching of introductory statistics more challenging. Students studying for research degrees (Masters and PhDs) are generally much more moti- vated to learn statistics. The next best thing to working with our own data is to work with some carefully selected examples from the literature. I have used some data from my own research but I have mainly tried to find small, relevant data sets that have been analysed in an interesting way. Most of them are from the life and environmental sciences (including ecology and evolution). I am very grateful to all of the people who have helped collect these data and to develop the analyses. For convenience I have tried to use data sets that are already available within the R software (the data sets are listed at the end of the book and described in the rele- vant chapter). 1.2 The R programming language for statistics and graphics R is now the principal software for statistics, graphics, and programming in many areas of science, both within academia and outside (many large companies use R). There are several reasons for this, including: R is a product of the statistical community: it is written by the experts. R is free: it costs nothing to download and use, facilitating collaboration. R is multiplatform: versions exist for Windows, Mac, and Unix. 9780198729068-Hector.indb 2 04/12/14 11:39 AM
INTRODUCTION 3 R is open-source software that can be easily extended by the R community. R is statistical software, a graphics package, and a programming language all in one. 1.3 Scope Statistics can sometimes seem like a huge, bewildering, and intimidating collection of tests. To avoid this I have chosen to focus on the linear model framework as the single most useful part of statistics (at least for research- ers in the environmental and life sciences). The book starts by introducing several different variations of the basic linear model analysis (analysis of variance, linear regression, analysis of covariance, etc). I then introduce two extensions: generalized linear models (GLMs) (for data with non- normal distributions) and mixed-effects models (for data with multiple levels and hierarchical structure). The book ends by combining these two extensions into generalized linear mixed-effects models. The advantage of following the linear model approach (and these extensions) is that a wide range of different types of data and experimental designs can be analysed with very similar approaches. In particular, all of the analyses covered in this book can be performed in R using only three main classes of func- tion; one for linear models (the lm() function), one for GLMs (the glm() function), and one for mixed-effects models (the lmer() and glmer() functions). 1.4 What is not covered Statistics is a huge subject, so lack of space obviously precluded the inclusion of many topics in this book. I also deliberately left some things out. Many biological applications like bioinformatics are not covered. For reasons of space, the coverage is limited to linear models and GLMs, with nothing on non-linear regression approaches nor additive models (generalized additive 9780198729068-Hector.indb 3 04/12/14 11:39 AM
4 NEW STATISTICS WITH R models, GAMs). Because of the focus on an estimation-based approach I have not included non-parametric statistics. Experimental design is covered briefly and integrated into the relevant chapters. Information theory and information criteria are briefly introduced, but the relatively new and devel- oping area of multimodel inference turned out to be largely beyond the scope of this book. Introducing Bayesian statistics is also a book-length pro- ject in its own right. 1.5 The approach There are several different general approaches within statistics (frequen- tist, Bayesian, information theory, etc) and there are many subspecies within these schools of thought. Most of the methods included in this book are usually described as belonging to ‘classical frequentist statistics’. However, this approach, and the probability values that are so widely used within it, has come under increasing criticism. In particular, statisticians usually accuse scientists of focusing far too much on P-values and not enough on effect sizes. This is strange, as the effect sizes—the estimates and intervals—are directly related to what we measure during our research. I don’t know any scientist who studies P-values! For that reason I have tried to take an estimation-based approach that focuses on estimates and confidence intervals wherever possible. Styles of analysis vary (and fash- ions change over time). Because of this I will be frank about some of my personal preferences used in this book. In addition to making wide use of estimates and intervals I have also tried to emphasize the use of graphs for exploring data and presenting results. I have tried to encourage the use of a priori contrasts (comparisons that were planned in advance) and I avoid the use of corrections for multiple comparisons (and discourage their use in many cases). The most complex approaches in the book are the mixed- effects models. Here I have stuck closely to the approaches advocated by the software writers (and their own books). Finally, at the end of each chapter I try to summarize both the statistical approach and what it has enabled us to learn about the science of each example. It is easy to get lost 9780198729068-Hector.indb 4 04/12/14 11:39 AM
INTRODUCTION 5 in statistics, but for non-statisticians the analysis should not become an end in its own right, only a method to help advance our science. 1.6 The new statistics? What is the ‘new’ statistics of the title? The term is not clearly defined but it appears to be used to cover both brand new techniques (e.g. meta- analysis, an approach beyond the scope of this book—I recommend the 2013 book by Julia Koricheva and colleagues, Handbook of meta-analysis in ecology and evolution) and a fresh approach to long-established meth- ods. I use the term to refer to two things. First, the book covers some rela- tively new methods in statistics, including modern mixed-effects models (and their generalized linear mixed-effects model extensions) and the use of information criteria and multimodel inference. The new statistics also includes a back to basics estimation-based approach that takes account of the recent criticisms of P-values and puts greater emphasis on estimates and intervals for statistical inference. 1.7 Getting started To allow a learning-by-doing approach the R code necessary to perform the basic analysis is embedded in the text along with the key output from R (the full R scripts will be available as support material from the R café at <http://www.plantecol.org/>). Some readers may be completely new to R, but many will have some familiarity with it. Rather than start with an introduction to R we will dive straight into the first example of a linear model analysis. However, a brief introduction to R is provided at the end of the book and newcomers to the software will need to start there. 9780198729068-Hector.indb 5 04/12/14 11:39 AM
9780198729068-Hector.indb 6 04/12/14 11:39 AM
The New Statistics with R. Andy Hector. © Andy Hector 2015. Published 2015 by Oxford University Press. 2 Comparing Groups: Analysis of Variance 2.1 Introduction Inbreeding depression is an important issue in the conservation of species that have lost genetic diversity due to a decline in their populations as a result of over-exploitation, habitat fragmentation, or other causes. We begin with some data on this topic collected by Charles Darwin. In The effects of cross and self-fertilisation in the vegetable kingdom, published in 1876, Darwin describes how he produced seeds of maize (Zea mays) that were fertilized with pollen from the same individual or from a different plant. Pairs of seeds taken from self-fertilized and cross-pollinated plants were then germinated in pots and the height of the young seedlings meas- ured as a surrogate for their evolutionary fitness. Darwin wanted to know whether inbreeding reduced the fitness of the selfed plants. Darwin asked his cousin Francis Galton—a polymath and early statistician famous for ‘regression to the mean’ (not to mention the silent dog whistle!)—for advice on the analysis. At that time, Galton could only lament that, ‘The determination of the variability . . . is a problem of more delicacy than that of determining the means, and I doubt, after making many trials whether it is possible to derive useful conclusions from these few observations. We ought to have measurements of at least fifty plants in each case’. Luckily we can now address this question using any one of several closely related 9780198729068-Hector.indb 7 04/12/14 11:39 AM
8 NEW STATISTICS WITH R linear model analyses. In this chapter we will use the analysis of variance (ANOVA) originally developed by Ronald Fisher (Box 2.1) and in Chapter 3 Student’s t-test. While Sir Ronald Fisher is one of the biggest names in the history of statistics he was employed for most of his career as a geneticist, a field in which he is held in similarly high esteem. Fisher developed ANOVA when working as a statistician at Rothamsted, an agricultural research station that is home to the famous Park Grass experiment, which was established in 1856 and has become the world’s longest running ecologic- al study. ANOVA was developed for the analysis of the experimental field data collect- ed at Rothamsted, hence the jargon of plots and blocks to reflect the way these experiments were laid out in the Rothamsted fields. As an undergraduate at Cam- bridge, Fisher also published the more general concept of maximum likelihood that we will meet in later chapters. vBox 2.1: Ronald A. Fisher The focus of this book is statistical analysis using R, not the R program- ming language itself (see the Appendix). R is therefore introduced rela- tively briefly, and if you are completely new to the language you will need to read the introductory material in the Appendix first and refer back to it as needed, together with the R help files. You should also explore the wealth of other information on R recommended there and available via the web. In R we can get Darwin’s data (Box 2.2) from an add-on package called SMPracticals after installing it from the Comprehensive R Archive Net- work website (together with any other packages it is dependent on) and activating it using the library() function. Notice the use of the hash symbol to add comments to the R code to help guide others through your analysis and remind yourself what the R script does: > install.packages(SMPracticals, dependencies= TRUE) > # install package from CRAN website > library(SMPracticals) # activates package for use in R > darwin # shows the data on screen 9780198729068-Hector.indb 8 04/12/14 11:39 AM
COMPARING GROUPS: ANALYSIS OF VARIANCE 9 Here we give Darwin’s data on the effect of cross- and self-pollination on the height (measured in inches to the nearest eighth of an inch and expressed in decimal form) of 30 maize plants as presented in the darwin data frame (R terminology for a data set) from the R SMPracticals package. Crossed and selfed plants were grown in pairs with three to five pairs grown in four pots. vBox 2.2: The Darwin data pot pair type height 1 I 1 Cross 23.500 2 I 1 Self 17.375 3 I 2 Cross 12.000 4 I 2 Self 20.375 5 I 3 Cross 21.000 6 I 3 Self 20.000 7 II 4 Cross 22.000 8 II 4 Self 20.000 9 II 5 Cross 19.125 10 II 5 Self 18.375 11 II 6 Cross 21.500 12 II 6 Self 18.625 13 III 7 Cross 22.125 14 III 7 Self 18.625 15 III 8 Cross 20.375 16 III 8 Self 15.250 17 III 9 Cross 18.250 18 III 9 Self 16.500 19 III 10 Cross 21.625 20 III 10 Self 18.000 21 III 11 Cross 23.250 22 III 11 Self 16.250 23 IV 12 Cross 21.000 24 IV 12 Self 18.000 25 IV 13 Cross 22.125 26 IV 13 Self 12.750 27 IV 14 Cross 23.000 28 IV 14 Self 15.500 29 IV 15 Cross 12.000 30 IV 15 Self 18.000 9780198729068-Hector.indb 9 04/12/14 11:39 AM
10 NEW STATISTICS WITH R A good place to start is usually by plotting the data in a way that makes sense in terms of our question—in this case by plotting the data divided into the crossed and selfed groups (Fig. 2.1). R has some graphical func- tions that come as part of the packages that are automatically installed along with the so-called base R installation when you download it from the CRAN website. However, I am going to take the opportunity to also intro- duce Hadley Wickham’s ggplot2 (Grammar of Graphics, version 2) pack- age that is widely used throughout this book. While ggplot2 has an all-singing all-dancing ggplot() function it also contains a handy qplot() function for quickly producing relatively simple plots (and which will take you a surprisingly long way). One advantage of this qplot() function is that its syntax is very similar to that of the base R graphics functions and other widely used R graphics packages such as Deepayan Sarkar’s Lattice. Luckily, ggplot2 is supported by a comprehensive website and book so it is easy to expand on the brief introduction and explanations given here. If you do not have the ggplot2 package on your computer you can get it by rerunning the install.packages() function given earlier but substituting ggplot2 in place of SMPracticals. Notice that the qplot() function has a data argument, and one restriction when using ggplot2 is that everything we want to use for Figure 2.1 The height of Darwin’s maize plants (in inches) plotted as a function of the cross- and self-pollinated treatment types. Notice how easy it is with ggplot2 to distinguish treatments with different symbol types, colours (seen as different shades of grey when colour is not available), or both and how a key is automatic- ally generated. Type H ei gh t 12 14 16 18 20 22 Cross Self Type Cross Self 9780198729068-Hector.indb 10 04/12/14 11:39 AM
COMPARING GROUPS: ANALYSIS OF VARIANCE 11 the plot must be in a single data frame (in this case everything we need is present in the darwin data frame but if it were not we would have to create a new data frame that contained everything we want to use for our graphic): > library(ggplot2) # activate package > qplot(type, height, data= darwin, shape= type, colour= > type)+theme_bw() > ggsave(file= "Hector_Fig2-1.pdf") # save graph as pdf Some of the advantages of ggplot2 over the base graphic functions are immediately obvious in how simple it is to use different symbol shapes, colours, and backgrounds (deleting the theme_bw() command for the black and white background will reveal the default theme_grey() set- ting that is handy when using colours like yellow that do not show up well against white) together with an automatically generated key with a legend. Even better, notice how easy it is to save the file (in various types, sizes, or rescalings) using the handy ggsave() function. Figure 2.1 suggests that the mean height may be greater for the crossed plants, which would be consistent with a negative effect of inbreeding. But how confident can we be in this apparent signal of inbreeding depression given the level of noise created by the variation within groups? The variabil- ity seems reasonably similar in the two groups except that the crossed group has a negative value lying apart from the others with a height of 12—a potential outlier. Actually, as we will see in Fig. 2.2, there are two values plotted on top of one another at this point (see the ggplot2 website and online supplementary R script at <http://www.plantecol.org/> for the use of geom(jitter) as an additional argument to the qplot() function that deals with this issue). This is typical of many data. The outlying negative heights could be due to attack by a pest or disease or because somebody dropped the pot, accidentally damaged the plant, or simply took the measurement incorrectly. It is hard to say anything more specific using this eyeball test since the difference between treatment groups is not that dramatic and there is a reasonable degree of variability within groups, not to mention 9780198729068-Hector.indb 11 04/12/14 11:39 AM
Comments 0
Loading comments...
Reply to Comment
Edit Comment