Population Genetics with R An Introduction for Life Scientists (Áki Jarl Láruson, Floyd Allan Reed) (z-library.sk, 1lib.sk, z-lib.sk)

(This page has no text content)

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi Population Genetics with R

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi Population Genetics with R An Introduction for Life Scientists ÁKI J. LÁRUSON & FLOYD A. REED 1

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi 3 Great Clarendon Street, Oxford OX2 6DP, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © Áki J. Láruson and Floyd A. Reed 2021 The moral rights of the authors have been asserted First Edition published in 2021 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: 2020946917 ISBN 978–0–19–882953–9 (hbk.) ISBN 978–0–19–882954–6 (pbk.) DOI: 10.1093/oso/9780198829539.001.001 Printed and bound by CPI Group (UK) Ltd, Croydon, CR0 4YY Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi Acknowledgements The authors would like to thank Victoria Sindorf and Vanessa Reed for their patience, support, and help in writing this book, and our families and friends, near and far, whose support means more than they’ll ever know. We would also like to thank Daniel Whitaker, Maria Costantini, Michael Wallstrom, Helen Sung, Justin Walguarnery, Molly Albecker, Sara Schaal, Alan Downey-Wall, Ffion Titmuss,Thais Bittar, andmany graduate students at U. H. Mānoa for testing sections, finding errors, and providing feedback. Special thanks to Katie Lotterhos, who had to put up with this book taking up entirely too much time. Thanks also to Ian Sherman and Charles Bath at OUP; their good nature and patience throughout this process has been remarkable, and special acknowledgments to Jolene Sutton, Jarosław Bryk, and Mohamed Noor for their help at various points along the road. Finally, the authors thank Charles Aquadro, Richard Harrison, Alex Kondrashov, Richard Durrett, Rasmus Nielsen, and many others for teaching fundamental aspects of population genetics.

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi Contents Acknowledgements v Chapter 1: Learning through Programming 1 1.1 Introduction 1 1.2 Organization 5 Chapter 2: Downloading and Installing R 7 2.1 Introduction to R 8 2.2 Working directories and saving 12 Chapter 3: Basic Commands in R 15 3.1 Input and calculations 15 3.2 Assigning objects to variables 18 3.3 Parts of a function 19 3.4 Classes of objects 22 3.5 Matrices 27 Chapter 4: Allele and Genotype Frequencies 35 4.1 Introduction to population genetics 35 4.2 Simulating genotypes 46 4.3 Calculating allele frequencies from datasets 56 Chapter 5: Statistical Tests and Algorithms 65 5.1 Deviation from expectations 65 5.2 Extending to more than two alleles 73 5.3 Blood types and allele frequencies 83 5.4 Expectation Maximization algorithm 87

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi viii CONTENTS Chapter 6: Genetic Variation 93 6.1 Genetic drift and evolutionary sampling 93 6.2 Variation over time 109 6.3 Quantifying variation 116 6.4 Equilibrium heterozygosity and effective population size 121 6.5 Overlapping generations 123 Chapter 7: Adaptation and Natural Selection 125 7.1 Positive selection 125 7.2 Adaptation, diploidy and dominance 133 Chapter 8: Population Differences 155 8.1 Quantifying divergence 155 8.2 Relative likelihood of the population of origin 159 8.3 DNA fingerprinting 167 Chapter 9: Pointing theWay to Additional Topics 171 9.1 The coalescent 172 9.2 Tests of neutrality 182 9.3 Linkage disequilibrium 185 9.4 Deleterious alleles 189 9.5 Fixation probability under selection and drift 190 9.6 Selfish genes 193 9.7 Broadening the models 195 9.8 R packages 196 References 201 Index 205

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi 1 Learning through Programming TermDefinitions Argument: A variable that is input into a function in a computer program. Command Line Interface (CLI): A computer display windowwhere a user can interface with the computer through text-based commands (sometimes called a terminal). Function: A section of computer programming code that uses arguments as an input, completes a set of operations with those arguments, then outputs a result. Graphical User Interface (GUI): A display where the user can interface with graphic icons using a cursor to make selections and issue commands. Heuristic: A method of learning through direct practical implementation, which is not necessarily the most efficient. Operating System: The software on a computer that serves as an interface between the computer’s hardware and other software programs. Windows, Mac, and Linux are examples of operating systems. 1.1 Introduction Population genetics, as a field of study, provides many important tools for scientists working with natural systems. It can be a conceptually chal- lenging discipline, since it does not generally keep track of the individual organisms we might be studying, like a blue whale or a pine tree, or the outcomes of specific breeding events, in contrast to classical genetics. Population Genetics with R: An Introduction for Life Scientists. Áki J. Láruson and Floyd A. Reed, Oxford University Press (2021). © Áki J. Láruson & Floyd A. Reed. DOI: 10.1093/oso/9780198829539.003.0001

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi 2 POPULATION GENETICS WITH R Instead, population genetics seeks to determine the fundamental dynamics of genetic variation across an entire population, or even multiple popula- tions, over the course of many generations. We can talk about changes in the frequency of specific genetic variants over time without keeping track of exactly which specific individuals do or do not have a particular copy of a variant. The famed quantitative biologist R. A. Fisher made a comparison between population genetics and the theory of gases in physics: The whole investigation may be compared to the analytical treatment of the Theory of Gases, in which it is possible to make the most varied assumptions as to the accidental circumstances, and even the essential nature of the individual molecules, and yet to develop the general laws as to the behaviour of gases, leaving but a few fundamental constants to be determined by experiment. (Fisher 1922). Essentially one can predict a relationship between the pressure, volume, and temperature of a gas much like one can predict relationships between group size, inbreeding, and migration between populations, without keep- ing track of the specific interactions of all the individual molecules. The individual interactions give rise to overall emergent properties of the system. This approach to biological questions is a bit abstract and can be dif- ficult for new students to visualize. The visualization and study of the dynamics of genetic elements within and between populations requires a quantitative approach. An unfortunate side effect of the widespread imple- mentation of ready-to-use quantitative software packages is that some facets of analysis can become rote, which at best might be implemented without the full understanding of the executor and at worst are applied inappropriately, leading to misguided conclusions. As quantitative models and methods become established in certain fields, it becomes necessary for people just entering a discipline to understand the thinking that goes into these approaches, so as to correctly interpret the results. In this bookwe aim to emphasize building an understanding of population genetics starting from fundamental principles. This book is not a guide to current software packages that can be used to carry out data processing in a “canned” way.

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi LEARNING THROUGH PROGRAMMING 3 Learning anything requires dedication. Whether we are aware of it or not, we are investing an exceptional amount of ourselves when we learn something new. It could be something aswillful as studying a new language, or something as innocuous asmemorizingmovie quotes.What we remem- ber and what we forget from an experience can sometimes surprise us, and it is the perspective of the authors that “learning by doing” is especially true for the quantitative approaches necessary to really understand the field of population genetics. The challenge facing anyone wishing to learn population genetics methodologies is that it can be prohibitively laborious to calculate by hand the many parameters and summary statistics that lie at the heart of conceptual understanding. Fortunately, there exist a great many computer programs that allow for dataset manipulation and the implementation of these quantitative approaches. Of particular note is the analytical software R, which has increasingly been the program of choice for exposure to basic statistical program- ming, as it is easily accessible (it’s free!), has cross-platform compatibility (Windows, Mac, and Linux all support distributions of R), and has the potential of hands-on implementation by the reader as well as using pre-packaged implementation by the educator (such as readily shareable function definitions). R can be used purely as a CLI, but also has the option of well-supported free-to-use GUIs, such as RStudio (more on that later). In this book we employ a series of heuristic approaches to help the reader develop an understanding of patterns and expectations in pop- ulation genetics. We make the disclaimer that we are approaching the coding in a way that highlights specific concepts and therefore frequently implement code in a way that may seem cumbersome and inelegant to experienced programmers.There aremany ways the code we present could be streamlined, but it is our hope that the structured way in which we present the code underscores the basic functional mechanisms involved as well as the population genetic concepts being addressed, and sets the reader up to move toward “leaner” and more efficient implementation later on in the book, as the expectation for both conceptual understanding and coding proficiency grows.

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi 4 POPULATION GENETICS WITH R We urge readers to think about learning population genetics as an iterative process, as represented below, with the particular order of steps varying depending on which concept is being developed: 1. Start with logical arguments which can bewritten down as amodel of a process. 2. Codify this model in a simple computer program to simulate the process and build a more intuitive understanding of the dynamics. 3. See how well the model predicts real data. 4. Use data to infer parameter values of the model. 5. Going back to step one, compare the original logical arguments and inference from data to refine the model. At a higher level, we should start to see re-occurring patterns and inter- actions between different processes, for example, mutation, migration, drift, selection, etc., which should hopefully feed into a broader under- standing of how these processes all contribute to the genetic make up of populations. This book is not an exhaustive treatment of population genetics, statis- tics, and programming. These are rich and extensive fields and this book is intended as an introduction to point the way, via ferrata, to begin learning about these subjects so you don’t have to start from scratch. The original literature is the best source of fundamental knowledge; however, it can be very cryptic and hard to follow at times. There are other textbooks that we highly recommend to continue learning and expanding your knowledge of population genetics. The full references for these textbooks are provided at the end of the book, but we want to highlight a select few here: • Hartl and Clark (2006). This has served as a standard workhorse textbook of population genetics for many years. • Gillespie (2004). This is a good resource for succinct additional theoretical background details about a range of population genetics topics.

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi LEARNING THROUGH PROGRAMMING 5 • Slatkin andNielsen (2013).This is another succinct guide that focuses on population genetics from a unifying perspective of the coalescent. • Hedrick (2010). Like Hartl and Clark (2006), this goes into extensive details about a wide range of population genetics topics. Ultimately, however, you will want to spend the time working through the original literature on subjects of interest to gain a fuller understanding. 1.2 Organization We have tried to be consistent with some visual cues throughout this book. When referring to file names, object names, or names of buttons on the keyboard, we’ll use teletype font to avoid confusion. We will color code different elements of the code; for example, when we talk about functionswe will highlight them with a nice orange, while arguments within functions will be colored blue. As an example, the basic structure of a function will be represented like this: function(argument = value). When we want to represent explicit code input and output, we’ll use a box around the commands like this: > print(”Hello world!”) [1] ”Hello world!” > seq(from = 1, to = 7, by = 2) # We’ll cover this soon! [1] 1 3 5 7 As our code boxes get longer andmore complex, we’ll often want to include helpful comments in our code (which we’ll color green) to help orient us and explain what is being done in different sections. To make comments inside your R code, just type “#” at the start of your comment and then R will ignore it! Notice the greater-than signs “>” and the “[1]” symbols in the above box? Whenever you type commands into the R terminal the line of input

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi 6 POPULATION GENETICS WITH R automatically starts with a “>” symbol. So there’s no need to type “>” at the beginning of your input code—it shows up automatically and signals the beginning of a new line of code. Similarly, the “[1]” shows up automatically whenever you have output from a code. In the example case above, we have one element that’s output, so it’s numbered as “[1].” If we had multiple elements of output from an input code, we would see each new element that’s output numbered sequentially (for example, [1], [2], [3], …etc.). But we’re getting ahead of ourselves. In the next chapter we’ll go through a brief background of the R language, and then go through actually finding and installing R on our computer.

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi 2 Downloading and Installing R TermDefinitions Comprehensive R Archive Network (CRAN): An online repository of the R base system and contributed packages. General Public License (GPL): A copyright agreement that allows for the free use, modification, and sharing of a software with the caveat that derivative work has to follow the same license. GNU’s Not Unix (GNU): An operating system and collection of software freely available through a GPL. Gnu: Bovids native to Africa in the antelope genus Connochaetes sp., also called wildebeest. HypertextMarkup Language (HTML): A language commonly used to buildwebpages. Interpreter: A computer program that executes (that is, implements) commands given in a specific programming language. Package: A bundle of code, datasets, and documentation that is readily shareable between R users. Script: A body of text containing pre-written code that can be executed all at once as a program. Population Genetics with R: An Introduction for Life Scientists. Áki J. Láruson and Floyd A. Reed, Oxford University Press (2021). © Áki J. Láruson & Floyd A. Reed. DOI: 10.1093/oso/9780198829539.003.0002

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi 8 POPULATION GENETICS WITH R 2.1 Introduction to R The R statistical programming language has rapidly become the language of choice not just for introductory statistics courses, but also as a beginning language for people wanting to learn programming. While there are many other languages that may be better suited for general programming (such as Python, Perl, C++, Ruby, and some would even say Java), R has dis- tinguished itself by being generally focused on dataset management and analysis, allowing neophytes to perform basic tasks relatively quickly. Back in the early nineties, Drs. Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, developed an interpreter (written in C) to support statistical programmingwork. Since bothDr. Ihaka andDr. Gentlemanwere familiar with the Bell Laboratory programming language S, this new statistical language took on a similar syntax to S. In part as an acknowledgment of the influence of S and in part based on the first name initials of the two developers, this new language was dubbed R (Ihaka 1998, Hornik 2017). R was first distributed as an open-source software project with a GPL in 1995, and then in 2000 the stable distribution of R version 1.0.0 was released. Since then, R has become especially popular among researchers and academics. One of R’s big strengths as a language has been its dedicated and growing user community. Between onlinemessage boards where users help each other solve problems and user-developed packages containing unique scripts that are readily available through the CRAN (https://cran. r-project.org/), the R community is undoubtedly amajor reason for the rapid adoption of this language by early learners. As of this writing, R is now in version 4.0.3 came out this last October, with millions of users worldwide and thousands of user-developed packages. So how to get started? Well, another great thing about R is how easily available it is. It does require basic computer hardware and a working internet connection, but if you’re using a Windows, Mac, or Linux oper- ating system, installing R should be quick and painless. The first step is to go to the CRAN website (https://cran.r-project.org/, see

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi DOWNLOADING AND INSTALLING R 9 Figure 2.1 The front page of the Comprehensive R Archive Network. Fig. 2.1), where you can download everything you’ll need to use R (for free!). Having one site where everyone has to go to download all things R can result in some heavy network traffic, so there are actually mirror sites, or duplicates, of the CRAN website hosted at institutions all over the world.These sites are identical to the original CRAN site but allow for users to download the R software and R packages from networks much closer to home and with lower traffic. This can really speed up downloads and prevents the site from being swamped. Once you’re on the CRAN main site, look for and click on the Mirrors link; it should be the top link of the left-hand menu. Find a country or institution that’s close to you and click on the associated link. This will take you to a CRAN mirror site. Everything should look exactly the same, except you’ll notice that the url is different. Now you can select one of the three links to download R for the operating system you’re using. If you’re using Windows you’ll want to select Download R for Windows and then click the base link on the subsequent page to find the Download R #.#.# for Windows link. If you’re using Mac OS you can select the Download R for (Mac) OS X link, then select the R-#.#.#.pkg link under the Latest release: label. If you are using

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi 10 POPULATION GENETICS WITH R an older distribution of Mac OS, there are instructions on this page on how to proceed depending on which version you are using. For Linux you can, hopefully unsurprisingly, select the Download R for Linux link to go to a list of directories. Each directory is specific to the distribution of Linux you are using, so if, for example, you were using Ubuntu, you would select the Ubuntu link and then look for the instructions under “To install the complete R system.” Each of the download pages should have key information necessary for troubleshooting download and installation issues. If you run into difficulties, a web search describing either your error message or simply “installing R on” followed by the name of the operating system you’re using should get you well on your way to troubleshooting most any issue that comes up. If everything went well, you now have R installed on your computer. On Windows you can simply double click the R.exe icon (if you didn’t make a desktop shortcut during installation, you should be able to find it in the bin folder in the larger R folder, by default located in Program Files). On Mac or Linux you can open up the command-line terminal and type in R Clicking the icon or typing in the above command should start up a command-line R session (Fig. 2.2). (Make sure it is a capital R, and if you Figure 2.2 The basic R terminal screen.

OUP CORRECTED PROOF – FINAL, 19/12/2020, SPi DOWNLOADING AND INSTALLING R 11 get stuck in the session type quit() to get out.) Everything we cover in this book should function perfectly well from this command line. However, there have been some excellent GUIs developed for R that add some bells and whistles that can really help with keeping track of scripts, objects, help pages, and generated figures. A GUI is simply a system of graphics you can click on with a cursor to issue a command or make a selection, instead of typing it out on a terminal. A GUI we’ve used a lot is provided by RStudio, which, like R, is also free and easy to install. RStudio is technically an Integrated Development Environment (IDE) which comes with a GUI, but for nowwe’re mostly interested in RStudio for GUI-related purposes. If you go to http://www.rstudio.com, you should be able to navigate to the option to download the RStudioDesktop installer for either aWindows, Mac, or Linux operating system (you have the option of purchasing a commercial license or downloading an RStudio server, but we will only be concerned with the free open-source licensed RStudio Desktop here). Once RStudio is successfully installed, you should open it up and see the four main areas of the GUI (Fig. 2.3). The main terminal, the same terminal we saw in Fig. 2.2, should be visible in the bottom left pane. Directly above that, you effectively have a Figure 2.3 The RStudio main screen.

Statistics

Uploader

Population Genetics with R An Introduction for Life Scientists (Áki Jarl Láruson, Floyd Allan Reed) (z-library.sk, 1lib.sk, z-lib.sk)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Recommended for You

Statistics

Uploader

Population Genetics with R An Introduction for Life Scientists (Áki Jarl Láruson, Floyd Allan Reed) (z-library.sk, 1lib.sk, z-lib.sk)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Recommended for You