Statistics
42
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2025-12-30

AuthorJustin C. Touchon

The statistical analyses that students of the life-sciences are being expected to perform are becoming increasingly advanced. Whether at the undergraduate, graduate, or post-graduate level, this book provides the tools needed to properly analyze your data in an efficient, accessible, plainspoken, frank, and occasionally humorous manner, ensuring that readers come away with the knowledge of which analyses they should use and when they should use them. The book uses the statistical language R, which is the choice of ecologists worldwide and is rapidly becoming the 'go-to' stats program throughout the life-sciences. Furthermore, by using a single, real-world dataset throughout the book, readers are encouraged to become deeply familiar with an imperfect but realistic set of data. Indeed, early chapters are specifically designed to teach basic data manipulation skills and build good habits in preparation for learning more advanced analyses. This approach also demonstrates the importance of viewing data through different lenses, facilitating an easy and natural progression from linear and generalized linear models through to mixed effects versions of those same analyses. Readers will also learn advanced plotting and data-wrangling techniques, and gain an introduction to writing their own functions. Applied Statistics with R is suitable for senior undergraduate and graduate students, professional researchers, and practitioners throughout the life-sciences, whether in the fields of ecology, evolution, environmental studies, or computational biology.

Tags
r
ISBN: 0198869339
Publisher: Oxford University Press
Publish Year: 2021
Language: 英文
Pages: 336
File Format: PDF
File Size: 8.3 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

(This page has no text content)
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi Applied Statistics with R
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi Applied Statistics with R A Practical Guide for the Life Sciences JUSTIN C. TOUCHON Department of Biology, Vassar College, USA 1
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi 3 Great Clarendon Street, Oxford, OX2 6DP, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © Justin C. Touchon 2021 The moral rights of the author have been asserted First Edition published in 2021 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: 2021934831 ISBN 978–0–19–886997–9 (hbk.) ISBN 978–0–19–886933–7 (pbk.) DOI: 10.1093/oso/9780198869979.001.0001 Printed and bound by CPI Group (UK) Ltd, Croydon, CR0 4YY Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi For Myra
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi Preface Welcome! The statistical analyses that life-scientists are being expected to perform are increasingly advanced and yet most graduate programs in the United States do not even offer a statistics course that teaches beyond Analysis of Variance (ANOVA) and linear regression. Undergraduate and graduate students are thus rarely provided with the opportunity to learn the types of analyses they need to know in order to publish and compete on the jobmar- ket, much less simply analyze their data appropriately. Part of the reason for this is that the way statistics are traditionally taught can be frustratingly slow and tedious. When I was a graduate student, I remember excitedly enrolling in a statistics class with the hope of learning how to analyze the data I was collecting each summer in the field. Unfortunately, we spent the entire semester learning how to perform an analysis of variance and a linear regression, by hand. There has to be a better way! This book is written with the belief that a comprehensive understanding of practical data analyses is not as daunting as it might seem. I have been teaching an annual statistics workshop at the Smithsonian Tropical Research Institute for more than 10 years and I know that my approach works. My teaching perspective is rooted in the idea that instead of spend- ing time mired in statistical theory and learning data analysis by hand, the most important thing to understand is what kind of data you have. Once you know your data, you can then figure out how to analyze them
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi viii APPLIED STATISTICS WITH R effectively. Whether at the undergraduate, graduate, or post-graduate level, this book will provide the tools needed to properly analyze your data in an efficient, accessible, plainspoken, frank, and (hopefully) humorous manner, ensuring that readers come away with the knowledge of which analyses they should use and when they should use them. This book uses the statistical language R, which is the choice of ecologists worldwide and is rapidly becoming the “go-to” stats program throughout the life sciences.The examples in the book are rooted in a single, real dataset (published in the journal Ecology in 2013) and use actual analyses that I have conducted in my professional career as an ecologist. The dataset is admittedly somewhat messy, and early chapters are designed so that stu- dents “clean” the raw data as a way of learning basic data manipulation skills and building good habits. Moreover, using a single relatively large dataset (~2500 observations) allows students to get a good understanding of what they are analyzing from chapter to chapter, instead of jumping from one small pre-cleaneddataset to another throughout the book. It also allows readers to see how they can view the same data through different lenses and allows an easy and natural progression from linear and generalized linear models to mixed effects versions of those same analyses, given the hierarchically nested design of the example experiment. Goals for the book It is my sincere hope that you find this book useful and instructive. I have tried my hardest to distill down everything I know and think about data analysis into these pages. You will undoubtedly find that some of what I suggest may differ from what you read elsewhere, either on the web or in other books. Just about everyone these days happens to be rather opin- ionated, and statisticians and R users are certaintly no different. Wherever possible, I have tried to include the rationale behind my thinking. Since you are reading this book, you evidently want to learn about data analysis. I applaud your initiative and to hope to reward you by teaching you how to do just that, efficiently and effectively.Here are the goals of this book.
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi PREFACE ix • I hope to build your familiarity with R from the ground up via the chapters and assignments. Even if you have some experience with R, you will likely learn new ways to approach your data If you are relatively new to R, I hope the hands on experience of typing along with the instructions will help you overcome “fear of the R prompt.” • I want to empower you to not only follow instructions carefully and analyze the data presented in these chapters, but hopefully to be able to analyze your own data and to think critically about data when you see them presented in research and in the public realm. As you may already know, science literacy is seriously lacking in the public sphere and increasing the number of people who can think critically about data presented in the news or elsewhere is extremly important. • Lastly, I hope you can become a part of the global R community. R is so big there is no single repository of information about it nor is there a single manual that contains all the possible instructions you might need to execute.Thus, in addition to books like this one, youwill need to become familiar with using the web to find answers to questions. I will provide examples in the later chapters of how you might seek out information to help yourself when (not if, mind you, but when) you get stuck or encounter an error. Basic layout of the book Thematerials presented in these chapters are set up as follows.There are ten topics, each an explanatory chapter which will allow you to teach yourself the code. I cannot stress enough that you really do want to type things in and you need to think about what the code means and what it is doing if youwant to learn this stuff. If you have an electronic copy of the book, avoid any temptation to cut and paste. If you are reading this, you are interested in learning R, right? Trust me, if you cut and paste code you will not learn as well as if you type it in by hand.
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi x APPLIED STATISTICS WITH R The ten topics are: 1 Introduction to R 2 Before You Begin (aka Thoughts on Proper Data Analysis) 3 Exploratory Data Analysis 4 The Basics of Plotting 5 Basic Statistical Analyses using R 6 More Linear Models! 7 Generalized Linear Models 8 Linear and Generalized Linear Mixed Effects Models 9 Data Wrangling and Advanced Plotting with the tidyverse 10 Writing Loops and Functions in R Just a note about how each of the chapters will be formatted. Bits of code that you can/should type in are displayed in light grey boxes, and the output from that code is generally displayed directly below it. For example, check out the code below. What is shown in the grey box “(2+2)” is what you would type at the R prompt, and the bit of code below it is the output from executing that command. 2+2 ## [1] 4 In general, if you type in exactly what is in the grey boxes you will get what is shown after it! Amazing, I know. Yourmind is already blown, right? The code that will be presented in this book is oftenwritten in a relatively “long” format in order to make it more readable. This might not exactly be how you type it to your computer though, which is perfectly fine. At the end of each chapter is a short set of assignments to give you the opportunity to practice what you have just learned. You can find solutions to the assigments at the GitHub page for the book (https://github.com/ jtouchon/Applied-Statistics-with-R) as well as other important informa- tion. Since R is an open source language it is likely that some of the code
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi PREFACE xi needed to run the examples in this book may change over time, and I will post code updates on that site. A little background about R R is a statistical programming package and a powerful graphics engine. R is considered to be a dialect of the S and S+ language that was cre- ated by AT&T Bell Labs. S is commercially available while R is open source and freely available through theComprehensive RArchiveNetwork: (https://cran.r-project.org). R has many advantages besides being freely available. For example, a user might program loops to conduct many repetitive statistical analyses or simulate thousands of data sets with known parameters. In addition, in the fields of Ecology and Evolutionary Biology at least, R is now by far the most commonly used statistical program (see Touchon and McCoy 2016 Ecosphere). There is substantial evidence that similar shifts are occurring in Psychology and Neuroscience as well. A little about how Rworks Because R creates objects from analyses that are stored in its memory, new users often are surprised by the fact that the results of their analyses are not immediately displayed on the screen. When you run something successfully, all you generally see is the prompt, which is denoted by the ‘>’ sign. There are several reasons for this. First, R does exactly what you tell it to do. Thus, if you tell it to run an ANOVA and store that output as an object, it does that, but you have to tell it a separate function to show you the object you created. Second, printing stuff on the screen takes time and computer power. By not showing everything that is going on, R is being very effcient. For example, if you wanted to do 100 regressions on different data sets, R can do this without opening 100 separate windows. One can store only the regression coefficients and display all of them in a single line for comparision. It is this flexibility thatmakes R a fantastic statistical program. Also, it’s free. Did I mention that it is free yet?
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi xii APPLIED STATISTICS WITH R This book provides an introduction to using R in data analyses with practical examples designed to be readily accessible to all life scientists. Although the example dataset I will use is ecological in nature, the parallels will hopefully be easy to see with other disciplines. Amore explicit discussion of this is at the end of Chapter 2. R is also a very powerful graphing tool and I will get you started on your way to making publication quality figures. This book is not a comprehensive overview of all available statistical approaches and methods or experimental design. No single book could do that. I will of course touch on many different topics, but there are over 16,000 packages available to use in R (as of July 2020), a number which is growing by the day, so such an overview is impossible. Learning R is like learning any language. At times it will be diffcult and frustrating, but it is worth it and if you stick with it you will have breakthroughs that feel amazing (I call these “R-gasms”). Over time, you may grow to love working in R! There is a quote I love from the musician, actor, author, poet, and all around amazing human Henry Rollins, which encapsulates a lot of how I think about doing statistical analyses and using R. Numbers are perfect, infallible and everlasting. You aren’t. Numbers are always right in the end. You may see an incorrect figure, but that’s not the fault of the number, the fault lies in the person doing the calculating. –Henry Rollins, High Adventure in the Great Outdoors Why do I like that quote so much? It’s because when you get an error in R, it is almost certainly your fault. R didn’t mess up, you did. Sorry, but that’s the honest truth. So check your code! :) Why learn R? Youmight be thinking to yourself “Whydo I need to learnR?” or “Seriously, I have to type everything in by hand?!” or “Can’t I do this easier in another program?” There are many answers to these questions.
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi PREFACE xiii • If you are an undergraduate thinking of going to graduate school, it is useful for you to learn R because you will almost certainly use R as a graduate student. Thus, you will have a leg up on everyone else! Get started now and be the best you can be. • Yes, you have to type everything in, but that also helps you learn what you are doing. It is very easy to click some buttons and get an answer that you don’t really understand. If you have to type in the code for the statistics you are doing, you will have a better understanding of what you are doing. • Having some basic familiarity with “coding” is increasingly useful across a variety of disciplines. You don’t need to be a pro, but being comfortable with a computer and with typing code to achieve a result is very useful. • Because it is free and extremely powerful, R is the only statistics program you will ever really need to know. If you go on to graduate 0.35 0.30 0.25 0.20 0.15 P ro p o rt io n o f c it in g p a p e rs 0.10 0.05 0.00 1990 1995 2000 2005 Year 2010 SAS R SPSS JMP Figure 0.1 This figure, from Touchon and McCoy (2016), demonstrates the rise in usage of R as compared to SAS, SPSS, and JMP, in the field of ecology. R really is the go-to program, so it is in your best interest to learn it. Touchon, J.C. and McCoy, M.W. (2016). “The mismatch between current statistical practice and doctoral training in ecology.” Ecosphere. 7(8):e01394. Reproduced under Creative Commons Attribution License (CC-BY)
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi xiv APPLIED STATISTICS WITH R school or into consulting or any field that deals with data, you will be able to use R. This book will teach you many of the basics you will need to know in R, but one of the best things about R is that it can be expanded to accomplish nearly any statistical (or, more generally, data analytic) needs you might have. The same cannot be said with other programs like JMP, SPSS, or SAS, which are very expensive and may not be available to you at another institution. Check out Figure 0.1 for evidence that R has become the program of choice (at least in Ecology, but the same is true in other fields as well). Okay, shall we get started?
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi Acknowledgments This book owes a tremendous debt to many people. First and foremost, thank you to Andy Jones and Stuart Dennis. The three of us took a germ of an idea—a desire to teach folks the practical tools they would need to analyze their data in R—and created the initial workshop at the Smithsonian Tropical Research Institute (STRI) that this material evolved from. Thank you to Owen McMillan, Adriana Bilgray, and Paola Gomez at STRI for their continued support of me and my desire to teach people how to use R. More generally, thank you to the amazing community of scientists at STRI for providing such an incredible environment to learn and conduct research.Many thanks to James Vonesh andMikeMcCoy, two invaluable mentors, colleagues, and friends over the years. Your knowledge of R certainly eclipses mine, and I hope I’ve done justice to all that you have taughtme.Thank you tomy doctoral and post-doctoral advisor Karen Warkentin. Karen and James wrote the National Science Foundation grant that generated the data used throughout this book. Many thanks to Tim Thurman for opening my eyes to the world of ggplot2 and dplyr. Thank you to the hundreds of interns, undergraduate, and graduate students, postdocs, and professional scientists that I have had the pleasure of teaching over the past decade or so. The lessons in this book have been continually refined and improved based on your feedback, so thank you for making me a better teacher. In particular, thank you to the students in my 2020
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi xvi APPLIED STATISTICS WITH R Applied Biostatistics class at Vassar College for the countless typos they found in early drafts of these chapters. Lastly, thank you to my wife Myra Hughey for her patience, support, and editorial advice over the years. You are the best partner in research and life I could ever hope for.
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi Contents Preface vii Acknowledgments xv Chapter 1: Introduction to R 1 1 Introduction to R 1 1.1 Overview 1 1.2 Getting started 2 1.3 Working from the script window 3 1.4 Creating well-documented and annotated code 4 1.5 Before we get started 9 1.6 Creating objects 9 1.7 Functions 13 1.8 What your data should look like before loading into R 21 1.9 Understanding various types of objects in R 23 1.10 A litany of useful functions 34 1.11 Assignment! 35 Chapter 2: Before You Begin (aka Thoughts on Proper Data Analysis) 37 2 Before You Begin (aka Thoughts on Proper Data Analysis) 37 2.1 Overview 37 2.2 Basic principles of experimental design 37 2.3 Blocked experimental designs 39 2.4 You can (and should) plan your analyses before you have the data! 41 2.5 Best practices for data analysis 42 2.6 How to decide between competing analyses 44 2.7 Data are data are data 45 Chapter 3: Exploratory Data Analysis and Data Summarization 49 3 Exploratory Data Analysis and Data Summarization 49 3.1 The Resource-by-Predation dataset 49
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi xviii APPLIED STATISTICS WITH R 3.2 Reading in the data file 51 3.3 Data exploration and error checking 52 3.4 Summarizing and manipulating data 69 3.5 Assignment! 75 Chapter 4: Introduction to Plotting 77 4 Introduction to Plotting 77 4.1 Principles of effective figure making 77 4.2 Data exploration using ggplot2 80 4.3 Plotting your data 89 4.4 Assignment! 101 Chapter 5: Basic Statistical Analyses using R 103 5 Basic Statistical Analyses 103 5.1 Determining what type of analysis to do 103 5.2 Avoiding pseudoreplication 107 5.3 Testing for normality in your data 109 5.4 Non-parametric tests 117 5.5 Introducing linear models 124 5.6 One-way analysis of variance—ANOVA 125 5.7 Multiple comparisons 132 5.8 Assignment! 136 Chapter 6: More Linear Models in R! 139 6 More Linear Models! 139 6.1 Getting started 140 6.2 Multi-way Analysis of Variance—ANOVA 141 6.3 Linear regression 153 6.4 Analysis of covariance (ANCOVA) 163 6.5 The predict() function 168 6.6 Plotting with ggplot() instead of qplot() 174 6.7 Assignment! 179 Chapter 7: Generalized Linear Models (GLM) 181 7 Generalized Linear Models (GLM) 181 7.1 Understanding non-normal data 181 7.2 GLMs 183 7.3 Understanding and interpreting the GLM 188 7.4 Calculating statistical significance with GLMs 195 7.5 Coding the data as a binomial GLM 198 7.6 Mixing GLMs and ANCOVAs together 200 7.7 Using the predict() function with a GLM 204
OUP CORRECTED PROOF – FINAL, 11/5/2021, SPi CONTENTS xix 7.8 Making a much easier GLM/ANCOVA plot using ggplot2 206 7.9 Assignment! 208 Chapter 8: Mixed Effects Models 209 8 Mixed Effects Models 209 8.1 Understanding mixed effects models 209 8.2 Assignment! 233 Chapter 9: Advanced Data Wrangling and Plotting 235 9 Advanced Data Wrangling and Plotting 235 9.1 The “tidyverse” 235 9.2 Basic data wrangling 238 9.3 Advanced data wrangling: Spreading and gathering your data 245 9.4 Even more advanced data wrangling! Using the do() function 249 9.5 Making better figures with ggplot2 256 9.6 Basics of ggplot2 257 9.7 Customizing your figure 267 9.8 Combining data wrangling with plotting with ggplot2 274 9.9 Assignment! 282 Chapter 10: Writing Loops and Functions in R 285 10 Writing Loops and Functions in R 285 10.1 for loops 286 10.2 Understanding functions 288 10.3 Writing functions 289 10.4 How a function works 289 10.5 Writing more complex functions: An example using simulations 292 10.6 Assignment! 306 Chapter 11: Final Thoughts 307 11 Final Thoughts 307 11.1 Understanding your data is the most important precursor to analyzing it 307 11.2 Knowing how to get help is essential 308 11.3 Your data analysis should be clear from the outset and you should avoid questionable techniques 308 11.4 Presenting your data in well-constructed figures is key 309 Index 311