(This page has no text content)
Beyond Spreadsheets with R A beginner’s guide to R and RStudio DR. JONATHAN CARROLL MANNING Shelter Island
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2018 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Development editor: Jenny Stout Project editors: Kevin Sullivan, Janet Vail Copy editor: Corbin Collins Proofreader: Tiffany Taylor Technical proofreader: Hilde Van Gysel Typesetter: Happenstance Type-O-Rama Cover designer: Marija Tudor ISBN 9781617294594 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – SP – 23 22 21 20 19 18
preface Data is everywhere, and it’s used in practically every industry in one way or another. One of the most common ways to interact with data, whether numbers or text, is with spreadsheet software. This approach offers several useful features: presenting data in a tabular view, allowing calculations to be performed using those values, and producing summaries of data. What spreadsheets don’t tend to provide is a way to do this repeatedly, reproducibly, or programmatically (without clicking or copying and pasting). Spreadsheets can be great for displaying data (including limited data summaries); but when you want to do something truly powerful with data, you need to go beyond them to a programming language. Data munging—manipulating raw data—is a cornerstone of data science. Munging techniques include cleaning, sorting, parsing, filtering, and pretty much anything else you need to do to make data truly useful. They say 90% of data science is preparing the data, and the other 90% is actually doing something with it. Don’t underestimate how important it is to carefully prepare data; analysis interpretations hinge on getting this step right. Using a programming language to perform data munging means the things you do to your data are recorded, can be reproduced from the raw source, and can be inspected later—even changed, if necessary. Trying to do this from a spreadsheet means either writing down which button to press when, or a broken link between output and input. I love using R. It’s useful in many ways. I never thought a language could be so flexible that it could calculate a t-test one moment and then request an Uber the next. Every word of this book has been processed by R code; the inline results were generated by actual R code and brought together using a third-party R package (knitr). I use R for the vast majority of my work, both data munging and analysis, which over the years has varied from estimating fish abundances to assessing genetic factors in cancer drug trials. I could not have done any of these things if I was limited to working in a spreadsheet program. Over the course of reading this book, you’ll learn enough of the ins and outs of the R programming language to be able to take the data you’re interested in and produce an analysis well beyond what you’d be able to accomplish with a spreadsheet. NOTE A message to those of you who have obtained a pirated copy of this book. Copyright infringement is commonly justified by those who partake in it by the notion that “no one loses anything.” That’s true. But only the infringer gains anything. Many, many hours went into the writing and publication of this book, and without a formal sale involved, any gain you receive from reading this book goes unnoticed and unappreciated. If you have an unofficial copy of this book and have found it useful, please consider buying a legitimate copy, either for yourself or for someone else you think might benefit from it.
acknowledgments I would like to thank Manning Publications for the opportunity to write this book, in particular the large team behind the scenes working to bring it all together, including my editor, Jenny Stout, and the production team of Kevin Sullivan, Janet Vail, and Tiffany Taylor and technical proofreader Hilde Van Gysel. I also thank the dedicated pool of reviewers who provided invaluable feedback during the book’s development, including: Anil Venugopal, Carlos Aya Moreno, Chris Heneghan, Daniel Zingaro, Danil Mironov, Dave King, Fabien Tison, Irina Fedko, Jenice Tom, Jobinesh Purushothaman, John D. Lewis, John MacKintosh, Michael Haller, Mohammed Zuhair Al-Taie, Nii Attoh-Okine, Stuart Woodward, Tony M. Dubitsky, and Tulio Albuquerque. I’d also like to thank the overwhelmingly helpful communities on Stack Overflow and Twitter (under the #rstats hashtag) and give a special mention to the Asciidoctor team, who have made a fantastic publishing toolchain. I am eternally grateful to the members of the diverse and supportive R community, the majority of whom voluntarily contribute packages to improve and extend the language. The feedback, suggestions, comments, and discussions I’ve had regarding the contents of this book from reviewers, Twitter followers, and colleagues have helped shape the book into what it is today, and for that I thank each of them. The maintainers of the R packages mentioned in this book deserve special recognition. The tidyverse of packages has transformed the way I use R and has made working with data much simpler. Producing the code output for this book wouldn’t have been possible without the knitr package, and for that I am most thankful. I would like to thank my wife and children for their support while I wrote this book over the course of around 2 years, without which I would surely have gone mad. Last but not least, I owe a great deal to the team behind the R language itself. This is open source software, available at no cost to its users. The team’s tireless efforts toward continually maintaining and improving this extensive project are greatly appreciated. Their citation can be found from R via the citation() function, which produces the following: R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R- project.org/.
about this book Who needs this book? You do, of course. Given that you’re reading this, I’m guessing that you have some data (stored as a spreadsheet, perhaps) and aren’t quite sure what to do with it. That’s fine; great, even. Maybe you want to learn something from your data. Maybe you want to find a new way to interact with it. Maybe you want to make a picture out of it. All great goals, but I’m also guessing you want to learn how to do some programming for the first time. I’m not going to assume you know how to program already, or that you are familiar with the jargon. Perhaps you’ve already picked up a few programming books and been scared off by how fast they fly through the introductory material trying to get you up to speed on every nuance of the way that particular language works. Not here. We’ll take things slow and work on a lot of examples together so that by the time we get to the end you’ll be comfortable with doing what you want to do with your data. I’m also not going to even mention statistics. That’s a topic for someone else to cover. If you don’t have a background in statistics, don’t worry; it’s not a requirement here. We’ll be looking at R programming, not statistics (which it, at least, is very good at). By the time you’ve finished reading this book, you should have a broad understanding of programming and how you do it with the R language; how data can be investigated, interrogated, and used to gain insights; and how to set yourself up for a robust, reproducible workflow that uses data to strengthen your conclusions. You’ll see how to take a small dataset and transform it into meaningful, publication-quality graphics with far more flexibility than any spreadsheet software can offer. With just a dozen commands, you can turn the data shown in figure 1 (the mtcars dataset already available from within R, as shown in the RStudio data viewer) into the graphic in figure 2.
Figure 1 The mtcars dataset, available from within R, as viewed in the RStudio data viewer. This data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
Figure 2. This visualization of the mtcars dataset plots the mileage (mpg, as well as fuel consumption in transformed units) against the engine displacement (disp) of the 32 vehicles, grouped both by the number of cylinders (cyl) and distinguished by their transmission (am), along with a linear fit to each cylinder group’s data. This is achieved, formatting and all, in just a dozen lines of R code. How to read this book I present each chapter to you in a no-nonsense manner; I cover what’s important and what’s likely to become an issue if you’re not careful. I can’t cover every way to approach a problem, and I may not do it necessarily the same way that other texts approach problems. But I try to show you what I consider to be the best approach first and back that up with some alternatives that you may be likely to also encounter in other reading. The goal here is to make you a competent and productive R user, which may mean showing you how to do things the slow way (as well as the fast way).
FORMATTING New terms and definitions are shown in italics when they are first mentioned. Code samples and data values are printed in a monospace font, either inline (for mentions of code) such as str(mtcars) or in code blocks for examples you should try yourself, such as this one: myData <- head(mtcars, n = 2) When a code sample produces output, this is shown below the input with the prefix #> and you should generally expect to see the same if you run the code yourself. The output for the vast majority of examples has been generated by R itself in the course of writing this book. Don’t worry if you try to run the lines starting with #>; they will be ignored by R: myData #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 #> Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Options that are available via a menu appear as a sequence of selections to make, such as File > Save > OK. And I tell you plainly which buttons to click and which keys you need to press. Examples are sometimes shown as blocks of annotated code, like this, which reads some data from a .csv file and calculates the average height value: peopleData <- read.csv(file = "people.csv") ① summary(peopleData) ② mean(peopleData$height) ③ ① Reads the data from the .csv file into a data.frame ② summary() acting on a data.frame returns a column-wise 5-number summary. ③ You can take the mean() of a column of values. Certain kinds of information are highlighted along the way: NOTE When a piece of information is particularly critical or important, it will be presented in a block like this one. Such blocks also indicate additional information, historical curiosities, or other notes. CAUTION R won’t always stop you from doing something you didn’t intend. In fact, sometimes it will seem to be actively trying to catch fire. Where fires are easily started, they’re pointed out like this to help you avoid them. TIP There are typically many ways to solve a problem using R, and I only discuss the simplest in any detail here. Where a better solution exists (but requires more information), I note it like this and try to give you enough information to go find out more yourself. In some cases, code blocks are not accompanied by output, because the code does not actually run. These code blocks are for illustration purposes only. Where output is shown, you should expect to get similar results when you run the code. Errors produced by R begin with the word Error. You’ll see lots of these in the code in this book. The precise wording of the error may differ slightly between versions. Please take care when entering blocks of code containing one of these errors, as that output cannot be parsed by R. Throughout the book I’ll also show you what a spreadsheet equivalent starting point might look like. I will use LibreOffice, which looks like figure 3, but the concepts will usually extend to Excel, Google Sheets, or whichever spreadsheet software you usually use.
Figure 3 An example of cells selected in LibreOffice (Linux) STRUCTURE As we progress through the book together, there will be lots of examples that I hope you will work through. Don’t just read them—run them on your computer yourself and see if you get the same answers. Then try a variation on the example and see if you get the result you expect. If you get something different, that’s great! It means you’ve found something to learn from, and your next task will be to understand why the result is what it is. I will try to progressively build up your knowledge of the relevant programming and R-specific terms, so don’t be afraid to go back and revise if something seems unfamiliar. Getting started Here's what you will need: • This book • A computer • A desire to learn something Really, that’s about it. R is a free (as in speech—openly available—and as in beer—it costs nothing) language, and we’ll be using more free software to interact with it. You will probably need an internet connection to download the (free) software, but after that the majority of examples will work offline. Follow along with the examples as they appear. Try different values and see if you get the result you expect. Break things and try to understand what happened. It’s very difficult to end up in a situation that can’t be resolved by restarting R, so feel free to experiment. This book won’t necessarily direct you toward how to solve your specific problems, but it should give you enough of a comprehension of the language and its ecosystem for you to begin working out what other tools you might need to use. If you’re working in genomics, there’s a good chance you’ll need some more advanced tools provided by the Bioconductor suite of packages: www.bioconductor.org. Many of the concepts and structures used there extend from those you’ll learn about in this book (though I don’t cover those here). Where to find more help Stack Overflow (https://stackoverflow.com) is an immensely useful source of information under the r tag, but it’s frequently overrun with poorly researched questions and thankless responses. Take the time to figure out if your question has already been answered (which happens regularly, given how many questions have been asked) before insisting that someone else solve your problem. If all else fails, typing what terms you do know and r or rstats into a search engine (such as Google) tends to produce some useful results more often than not. The R Weekly site (https://rweekly.org) provides a weekly summary of the most interesting R posts from around the web. R-bloggers (https://r-bloggers.com) provides a syndication of many popular R-
related blogs and has fresh content daily. Follow along with some of these that align with your interests, and you’re bound to come across some useful tips. Finally, reach out to your local community, either in person (try https://meetup.com) or online (Twitter, #rstats). More about this book This book was written in the AsciiDoc plain-text markup language using emacs and RStudio. The R code herein was evaluated using a custom package library defined via the switchr R package and intertwined among the source using the knitr R package. The session information describing the environment defining this custom library is as follows: #> setting value #> version R version 3.4.3 (2017-11-30) #> system x86_64, linux-gnu #> ui X11 #> language en_AU:en #> collate en_AU.UTF-8 #> tz Australia/Adelaide #> date 2018-01-23 #> #> package * version date source #> assertthat 0.2.0 2017-04-11 CRAN (R 3.4.3) #> backports 1.1.2 2017-12-13 CRAN (R 3.4.3) #> base * 3.4.3 2017-12-01 local #> bindr 0.1 2016-11-13 CRAN (R 3.4.3) #> bindrcpp 0.2 2017-06-17 CRAN (R 3.4.3) #> broom 0.4.3 2017-11-20 CRAN (R 3.4.3) #> cellranger 1.1.0 2016-07-27 CRAN (R 3.4.3) #> cli 1.0.0 2017-11-05 CRAN (R 3.4.3) #> colorspace 1.3-2 2016-12-14 CRAN (R 3.4.3) #> commonmark 1.4 2017-09-01 CRAN (R 3.4.3) #> compiler 3.4.3 2017-12-01 local #> crayon 1.3.4 2017-09-16 CRAN (R 3.4.3) #> crosstalk 1.0.0 2016-12-21 CRAN (R 3.4.3) #> curl 3.1 2017-12-12 CRAN (R 3.4.3) #> data.table 1.10.4-3 2017-10-27 CRAN (R 3.4.3) #> datasauRus * 0.1.2 2017-05-08 CRAN (R 3.4.3) #> datasets * 3.4.3 2017-12-01 local #> devtools * 1.13.4 2017-11-09 CRAN (R 3.4.3) #> digest 0.6.14 2018-01-14 CRAN (R 3.4.3) #> dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.3) #> evaluate 0.10.1 2017-06-24 CRAN (R 3.4.3) #> forcats * 0.2.0 2017-01-23 CRAN (R 3.4.3) #> foreign 0.8-67 2016-09-13 CRAN (R 3.3.1) #> ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.3) #> glue 1.2.0 2017-10-29 CRAN (R 3.4.3) #> graphics * 3.4.3 2017-12-01 local #> grDevices * 3.4.3 2017-12-01 local #> grid 3.4.3 2017-12-01 local #> gtable 0.2.0 2016-02-26 CRAN (R 3.4.3) #> haven 1.1.1 2018-01-18 CRAN (R 3.4.3) #> here * 0.1 2017-05-28 CRAN (R 3.4.3) #> hms 0.4.0 2017-11-23 CRAN (R 3.4.3) #> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.3) #> htmlwidgets * 1.0 2018-01-20 CRAN (R 3.4.3) #> httpuv 1.3.5 2017-07-04 CRAN (R 3.4.3) #> httr * 1.3.1 2017-08-20 CRAN (R 3.4.3) #> jsonlite 1.5 2017-06-01 CRAN (R 3.4.3)
#> knitr * 1.18 2017-12-27 CRAN (R 3.4.3) #> lattice 0.20-35 2017-03-25 CRAN (R 3.3.3) #> lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.3) #> leaflet * 1.1.0 2017-02-21 CRAN (R 3.4.3) #> lubridate 1.7.1 2017-11-03 CRAN (R 3.4.3) #> magrittr 1.5 2014-11-22 CRAN (R 3.4.3) #> mapproj * 1.2-5 2017-06-08 CRAN (R 3.4.3) #> maps * 3.2.0 2017-06-08 CRAN (R 3.4.3) #> memoise 1.1.0 2017-04-21 CRAN (R 3.4.3) #> methods * 3.4.3 2017-12-01 local #> mime 0.5 2016-07-07 CRAN (R 3.4.3) #> misc3d 0.8-4 2013-01-25 CRAN (R 3.4.3) #> mnormt 1.5-5 2016-10-15 CRAN (R 3.4.3) #> modelr 0.1.1 2017-07-24 CRAN (R 3.4.3) #> munsell 0.4.3 2016-02-13 CRAN (R 3.4.3) #> nlme 3.1-131 2017-02-06 CRAN (R 3.4.0) #> openxlsx 4.0.17 2017-03-23 CRAN (R 3.4.3) #> parallel 3.4.3 2017-12-01 local #> pillar 1.1.0 2018-01-14 CRAN (R 3.4.3) #> pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.3) #> plot3D * 1.1.1 2017-08-28 CRAN (R 3.4.3) #> plyr 1.8.4 2016-06-08 CRAN (R 3.4.3) #> psych 1.7.8 2017-09-09 CRAN (R 3.4.3) #> purrr * 0.2.4 2017-10-18 CRAN (R 3.4.3) #> R6 2.2.2 2017-06-17 CRAN (R 3.4.3) #> Rcpp 0.12.15 2018-01-20 CRAN (R 3.4.3) #> readr * 1.1.1 2017-05-16 CRAN (R 3.4.3) #> readxl 1.0.0 2017-04-18 CRAN (R 3.4.3) #> reshape2 * 1.4.3 2017-12-11 CRAN (R 3.4.3) #> rex * 1.1.2 2017-10-19 CRAN (R 3.4.3) #> rio * 0.5.5 2017-06-18 CRAN (R 3.4.3) #> rlang * 0.1.6 2017-12-21 CRAN (R 3.4.3) #> rmarkdown * 1.8 2017-11-17 CRAN (R 3.4.3) #> roxygen2 * 6.0.1 2017-02-06 CRAN (R 3.4.3) #> rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3) #> rstudioapi 0.7 2017-09-07 CRAN (R 3.4.3) #> rvest 0.3.2 2016-06-17 CRAN (R 3.4.3) #> scales 0.5.0 2017-08-24 CRAN (R 3.4.3) #> shiny 1.0.5 2017-08-23 CRAN (R 3.4.3) #> stats * 3.4.3 2017-12-01 local #> stringi 1.1.6 2017-11-17 CRAN (R 3.4.3) #> stringr * 1.2.0 2017-02-18 CRAN (R 3.4.3) #> switchr * 0.12.6 2017-11-07 CRAN (R 3.4.1) #> testthat * 2.0.0 2017-12-13 CRAN (R 3.4.3) #> tibble * 1.4.1 2017-12-25 CRAN (R 3.4.3) #> tidyr * 0.7.2 2017-10-16 CRAN (R 3.4.3) #> tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.3) #> tools 3.4.3 2017-12-01 local #> utils * 3.4.3 2017-12-01 local #> withr 2.1.1 2017-12-19 CRAN (R 3.4.3) #> xml2 1.1.1 2017-01-24 CRAN (R 3.4.3) #> xtable 1.8-2 2016-02-05 CRAN (R 3.4.3 Details for installing the specific versions of these packages are provided in appendix C. The code for the examples in the book is located at https://github.com/BeyondSpreadsheetsWithR/Book. There is also an issue tracker where people can link directly to the R code in which they find an issue: https://github.com/BeyondSpreadsheetsWithR/Book/issues. The source code is also available from the publisher’s website at www.manning.com/books/beyond-spreadsheets-with-r.
Book forum Purchase of Beyond Spreadsheets with R includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and from other users. To access the forum, go to https://forums.mannning.com/forums/beyond-spreadsheets-with-r. You can also learn more about Manning’s forums and the rules of conduct at https://forums.manning.com/forums/about. Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the author challenging questions lest his interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print. about the author Ewa Jermakowicz JONATHAN CARROLL holds a PhD in theoretical astrophysics from the University of Adelaide, Australia, and is currently working as an independent contractor providing R programming services in data science. He contributes packages to R, is a frequent contributor of answers on StackOverflow, and is an avid science communicator.
about the cover illustration The figure on the cover of Beyond Spreadsheets with R is captioned “Habit of a Turkish Dancer in 1700.” The illustration is taken from Thomas Jefferys’ A Collection of the Dresses of Different Nations, Ancient and Modern (four volumes), London, published between 1757 and 1772. The title page states that these are hand-colored copperplate engravings, heightened with gum arabic. Thomas Jefferys (1719–1771) was called “Geographer to King George III.” He was an English cartographer who was the leading map supplier of his day. He engraved and printed maps for government and other official bodies and produced a wide range of commercial maps and atlases, especially of North America. His work as a map maker sparked an interest in local dress customs of the lands he surveyed and mapped, which are brilliantly displayed in this collection. Fascination with faraway lands and travel for pleasure were relatively new phenomena in the late 18th century, and collections such as this one were popular, introducing both the tourist as well as the armchair traveler to the inhabitants of other countries. The diversity of the drawings in Jefferys’ volumes speaks vividly of the uniqueness and individuality of the world’s nations some 200 years ago. Dress codes have changed since then, and the diversity by region and country, so rich at the time, has faded away. It’s now often hard to tell the inhabitants of one continent from another. Perhaps, trying to view it optimistically, we’ve traded a cultural and visual diversity for a more varied personal life—or a more varied and interesting intellectual and technical life. At a time when it’s difficult to tell one computer book from another, Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by Jeffreys’ pictures.
1 Introducing data and the R language This chapter covers Why data analysis is important How to make your analysis robust How and why R works with data RStudio: Your interface to R You have your data, and you want to start doing something awesome with it, right? Brilliant! I promise you, we’ll get to that as soon as we can. But first, let’s take a step back. Telling you to dive right in now would be like handing you a pile of different timbers, pointing you toward the workshop, and telling you to make some furniture. It’s a good idea to first understand both the materials and the tools you’re about to use. We’ll go through what data means in general — to you and to those who may potentially inherit your data — because if you don’t fully comprehend what you already have, then building on that won’t be useful (and at worst will be flat out wrong). Poorly preparing data merely delays dealing with it properly and grows your technical debt (making things easier now, but later making it necessary to pay back that time when you have difficulties working with poorly formed data). We’ll discuss how to set yourself up for a rigorous analysis (one that can be repeated) and then begin working with one of the best data analysis tools available: the R History Topics Learning Paths Offers & Deals Highlights Settings Support Sign Out
programming language. For now, let’s go through what it means to “have some data.” 1.1 Data: What, where, how? I said you have some data that you want to do something with, which wasn’t a very precise statement. That was intentional. I guarantee you have some data even if you don’t realize it. You may be thinking that data is exclusively whatever is stored in your Excel file, but data is much more than that. We all have data, because it’s everywhere. Before you go analyzing your own data, it’s important to recognize its structure (both as you understand it, and as R will) so that you begin with a solid foundation of what it means to have some data. 1.1.1 WHAT IS DATA? Data exists in many forms, not just as numbers and letters in a spreadsheet. It may also be stored in a different file type, such as commaseparated values (CSV), as words in a book, or as values in a table on a web page. NOTE It’s common to store commaseparated values in a .csv file. This format is particularly useful because it’s plain text — values separated by commas. We’ll return to why that’s useful in section 1.1.6. Data may not be stored at all — streaming data comes as a flow of information, such as the signal your TV picks up and processes, your Twitter feed, or the output from a measuring device. We can store this data if we want to, but often we want to understand the flow as it’s happening. Data isn’t always pretty (in fact, most times it’s dirty, mundane, and seemingly uninteresting), and it isn’t always in the format we want. Having some tools on hand to manage data is a powerful advantage and is critical to achieving a reliable goal, but that’s only useful if you know what your data represents before you do anything further with it. “Garbage in, garbage out” warns that you can’t perform an analysis on terrible data and expect to get a meaningful result. You may very well have tried to evaluate a calculation in Excel only to have the result show up as #VALUE! because you tried to divide a number by some text, even though that “text” looked like numbers. The types of your values (text, numbers, images, and so on) are themselves pieces of data with possible meanings behind them, and you’ll learn how to best make use of them. So what is “good data”? What do the values you have represent?
1.1.2 SEEING THE WORLD AS DATA SOURCES We experience the world through our senses — touching, seeing, hearing, tasting, smelling, and generally absorbing life around us. Each of those input channels handles available data, and our brains process them, mixing the signals together to form our picture of the world in a brilliantly complex way that we constantly take for granted. Every time you use any of your senses, you’re taking a measurement of the world. How bright is the sun today? Is a car approaching? Is something burning? Is there enough coffee left in the pot for another cup? We construct measuring tools to make life easier for us and handle some of the data consistently — thermometers to measure temperatures, scales to measure weights, rulers to measure lengths. We go a step further and create more tools to summarize that data — car instrument panels to simplify the internal measurements of the engine; weather stations to summarize temperature, wind, and pressure. With the digital age, we now have an overload of data sources at our disposal. The internet provides data on virtually any and all aspects of the world we might be interested in, and we create more tools to manage these — weather, finance, social media, the number of astronauts currently in space (www.howmanypeopleareinspacerightnow.com), lists of episodes of The Simpsons, all available at our disposal. The world is truly made up of data. That’s not to say the data is in any way finite. We constantly add to the available sources of data, and by asking new questions we can identify new data we want to obtain. Data itself also generates more data. Metadata is the additional data that describes some other data — the number of subjects in a trial, the units of a measurement, the time at which a sample was taken, the website from which the data was collected. All these are data too and need to be stored, maintained, and updated as they change. You interact with data in various ways all the time. One of the greatest achievements of the World Wide Web has been to gather, collate, and summarize our data for us in more easily digestible forms. Think about how you would have requested a taxi 20 years ago, before the rise of smartphones and the app ecosystem. You’d look up the phone number of a taxi company, phone them, tell the dispatcher where you were or would be, where you wanted to go, and what time you wanted to be picked up. The dispatcher would send out the request to all drivers, one of whom would accept the request. At the end of your journey, you’d pay with cash or a card transaction and receive a receipt. Now, with the digital connections between devices, continuous internet access, and
GPS tracking, that process simplifies to opening a rideshare app, entering your destination, and receiving a fare estimate, because your phone already knows where you are. The rideshare program receives this data and selects an appropriately close/available driver, exchanges your contact details in case anyone needs them, and routes the driver to you. At the end of your journey, your account is charged the appropriate amount, and a receipt is emailed to you. In both cases, the same data flowed between all the parties. In the latter, fewer people needed to be involved because the computer systems have access to the relevant data. Your phone communicates with the rideshare server, your phone communicates with the GPS system to locate itself, and the rideshare server communicates with a payment server to authorize payment and the email server to send the receipt. At every point along the way, various data can be collected (anonymously, where required) and saved for later analysis. How many people requested rides to the airport this month? What was the average distance travelled? What was the average wait time? Do people request more expensive trips from Apple or Android devices? Some of this was available previously, but it has never been easier to aggregate and compare. Many businesses open up access to thirdparty developers using an application programming interface (API) so that the data can be more systematically accessed. For example, Uber has an API that allows software to ask for fare estimates or ride histories (with authentication, to approved accounts). This is how your phone app is able to communicate with the Uber servers. Sure enough, someone has written an R package to work with this API, meaning you can include data direct from Uber in your analysis, or (in theory) request a ride direct from R. NOTE Good software has a documented way to interact with it so that users and the software are able to communicate clearly and effectively. This can describe requests that can be sent to a server (and the expected responses) or just how a function should be used (and the expected return value). 1.1.3 DATA MUNGING Data munging refers to the cleaning up and preparation of data. Most data collected isn’t ready to be used in an analysis or presentation. Usually there are inputs to validate, summaries to calculate, values to combine or remove, or restructuring to perform. This is a commonly overlooked aspect of using data for science, but it’s of vital
importance. Failing to properly handle data can lead to difficulty working with it and, worse, incorrect conclusions drawn from it. The terms data munging, data wrangling, data science, data analysis, data hazmat, and many others are all names for more or less the same thing, with different emphases and different trajectories depending on where the data is coming from or going to. Most analyses (be they elaborate, sophisticated regressions, or simple visualizations) begin with some form of data munging. Often that’s merely reading the data into software, in which case some of the handling is performed on your behalf with assumptions (these values are treated as dates, these as words, and so on). Having the power to control how that handling is performed can be essential when those assumptions are broken, or when you want to treat your data in a particular way. Any time you have groups of records in your data, whether years, patients, animals, colors, vehicles, or anything else, and you need to treat them differently (color a line a certain way, only include records in an average of similar things, calculate how a quantity has changed between groups), you’ll perform data munging because you need to allocate records to a particular group somehow. Any other transformation, cleaning, or processing of the data also counts toward data munging. It quickly becomes apparent that a large portion of any analysis can (or should) involve a lot of data munging if its conclusions are to be trusted. 1.1.4 WHAT YOU CAN DO WITH WELL-HANDLED DATA I hope it’s clear by this point that data is potentially of great importance. It is routinely more than just numbers in a table. Medical data often represents real human lives and the effect a particular intervention has had, be that lifesavingly positive or tragically negative. These effects aren’t always immediately obvious to someone viewing them from a given perspective, so it’s the role of the data analyst (professional or incidental) to extract patterns from data in order to make a decision. Analysis of data is often useful in extracting nonobvious patterns. For example, although you may recognize a pattern to the sequence #> 2 4 6 8 10 12 14 16 18 20 (counting by twos), it may not be so clear what the pattern is in the following data #> 0.000 0.841 0.909 0.141 0.757 0.959 0.279 0.657 0.989 0.412
until you visualize the data (which was generated with a sin() function), as shown in figure 1.1. Having the right tools at hand to analyze our data means we can identify hidden patterns, forecast new information, and learn from the data. Figure 1.1 A pattern emerges. These points were generated with a sin() function at the values 0, 1, …, 9. The smooth sin() function is also plotted here. A classic example of data analysis is that of John Snow and the 1854 Broad Street cholera outbreak in London. People were dying by the hundreds within a particular district at a time when sewerage infrastructure was all but nonexistent and the understanding of infectious diseases was highly limited. By carefully examining the locations of the cholera cases, John Snow was able to infer that the common link between them appeared to be that their closest source of water was a particular pump on Broad Street. Once the pump was disabled, cases of cholera diminished significantly. In this case, the data was in plain sight — the locations of cholera cases — but the pattern and connection weren’t immediately apparent. See figure 1.2.
Comments 0
Loading comments...
Reply to Comment
Edit Comment