Mastering Spark with R. The Complete Guide to Large-Scale Analysis and Modeling (Javier Luraschi, Kevin Kuo, Edgar Ruiz) (z-library.sk, 1lib.sk, z-lib.sk)
Author: Javier Luraschi, Kevin Kuo, Edgar Ruiz
数据
The Complete Guide to Large-Scale Analysis and Modeling.Converted from epub print edition has 293 pages, pdf has 388.
📄 File Format:
PDF
💾 File Size:
16.2 MB
7
Views
0
Downloads
0.00
Total Donations
📄 Text Preview (First 20 pages)
ℹ️
Registered users can read the full content for free
Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.
📄 Page
1
(This page has no text content)
📄 Page
2
Mastering Spark with R The Complete Guide to Large-Scale Analysis and Modeling Javier Luraschi, Kevin Kuo, and Edgar Ruiz
📄 Page
3
Mastering Spark with R by Javier Luraschi, Kevin Kuo, and Edgar Ruiz Copyright © 2020 Javier Luraschi, Kevin Kuo, and Edgar Ruiz. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisition Editor: Jonathan Hassell Development Editor: Melissa Potter Production Editor: Elizabeth Kelly Copyeditor: Octal Publishing, LLC Proofreader: Rachel Monaghan Indexer: Judy McConville Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest October 2019: First Edition Revision History for the First Release
📄 Page
4
2019-10-04: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492046370 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Mastering Spark with R, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-04637-0 [LSI]
📄 Page
5
Dedication To Adrian, Clara, Julian, Max, Mila and Roman.
📄 Page
6
Foreword Apache Spark is a distributed computing platform built on extensibility: Spark’s APIs make it easy to combine input from many data sources and process it using diverse programming languages and algorithms to build a data application. R is one of the most powerful languages for data science and statistics, so it makes a lot of sense to connect R to Spark. Fortunately, R’s rich language features enable simple APIs for calling Spark from R that look similar to running R on local data sources. With a bit of background about both systems, you will be able to invoke massive computations in Spark or run your R code in parallel from the comfort of your favorite R programming environment. This book explores using Spark from R in detail, focusing on the sparklyr package that enables support for dplyr and other packages known to the R community. It covers all of the main use cases in detail, ranging from querying data using the Spark engine to exploratory data analysis, machine learning, parallel execution of R code, and streaming. It also has a self- contained introduction to running Spark and monitoring job execution. The authors are exactly the right people to write about this topic—Javier, Kevin, and Edgar have been involved in sparklyr development since the project started. I was excited to see how well they’ve assembled this clear and focused guide about using Spark with R. I hope that you enjoy this book and use it to scale up your R workloads and connect them to the capabilities of the broader Spark ecosystem. And because all of the infrastructure here is open source, don’t hesitate to give the developers feedback about making these tools better. Matei Zaharia Assistant Professor at Stanford University, Chief Technologist at Databricks, and original creator of Apache Spark
📄 Page
7
Preface In a world where information is growing exponentially, leading tools like Apache Spark provide support to solve many of the relevant problems we face today. From companies looking for ways to improve based on data- driven decisions, to research organizations solving problems in health care, finance, education, and energy, Spark enables analyzing much more information faster and more reliably than ever before. Various books have been written for learning Apache Spark; for instance, Spark: The Definitive Guide is a comprehensive resource, and Learning Spark is an introductory book meant to help users get up and running (both are from O’Reilly). However, as of this writing, there is neither a book to learn Apache Spark using the R computing language nor a book specifically designed for the R user or the aspiring R user. There are some resources online to learn Apache Spark with R, most notably the spark.rstudio.com site and the Spark documentation site at spark.apache.org. Both sites are great online resources; however, the content is not intended to be read from start to finish and assumes you, the reader, have some knowledge of Apache Spark, R, and cluster computing. The goal of this book is to help anyone get started with Apache Spark using R. Additionally, because the R programming language was created to simplify data analysis, it is also our belief that this book provides the easiest path for you to learn the tools used to solve data analysis problems with Spark. The first chapters provide an introduction to help anyone get up to speed with these concepts and present the tools required to work on these problems on your own computer. We then quickly ramp up to relevant data science topics, cluster computing, and advanced topics that should interest even the most experienced users. Therefore, this book is intended to be a useful resource for a wide range of users, from beginners curious to learn Apache Spark, to experienced readers
📄 Page
8
seeking to understand why and how to use Apache Spark from R. This book has the following general outline: Introduction In the first two chapters, Chapter 1, Introduction, and Chapter 2, Getting Started, you learn about Apache Spark, R and the tools to perform data analysis with Spark and R. Analysis In Chapter 3, Analysis, you learn how to analyze, explore, transform, and visualize data in Apache Spark with R. Modeling In the Chapter 4, Modeling and Chapter 5, Pipelines, you learn how to create statistical models with the purpose of extracting information, predicticting outcomes, and automating this process in production-ready workflows. Scaling Up to this point, the book has focused on performing operations on your personal computer and with limited data formats. Chapter 6, Clusters, Chapter 7, Connections, Chapter 8, Data, and Chapter 9, Tuning, introduce distributed computing techniques required to perform analysis and modeling across many machines and data formats to tackle the large- scale data and computation problems for which Apache Spark was designed. Extensions Chapter 10, Extensions, describes optional components and extended functionality applicable to specific, relevant use cases. You learn about alternative modeling frameworks, graph processing, preprocessing data for deep learning, geospatial analysis, and genomics at scale. Advanced
📄 Page
9
The book closes with a set of advanced chapters, Chapter 11, Distributed R, Chapter 12, Streaming, and Chapter 13, Contributing; these will be of greatest interest to advanced users. However, by the time you reach this section, the content won’t seem as intimidating; instead, these chapters will be equally relevant, useful, and interesting as the previous ones. The first group of chapters, 1–5, provides a gentle introduction to performing data science and machine learning at scale. If you are planning to read this book while also following along with code examples, these are great chapters to consider executing the code line by line. Because these chapters teach all of the concepts using your personal computer, you won’t be taking advantage of multiple computers, which Spark was designed to use. But worry not: the next set of chapters will teach this in detail! The second group of chapters, 6–9, introduces fundamental concepts in the exciting world of cluster computing using Spark. To be honest, they also introduce some of the not-so-fun parts of cluster computing, but believe us, it’s worth learning the concepts we present. Besides, the overview sections in each chapter are especially interesting, informative, and easy to read, and help you develop intuitions as to how cluster computing truly works. For these chapters, we actually don’t recommend executing the code line by line —especially not if you are trying to learn Spark from start to finish. You can always come back and execute code after you have a proper Spark cluster. If you already have a cluster at work or you are really motivated to get one, however, you might want to use Chapter 6 to pick one and then Chapter 7 to connect to it. The third group of chapters, 10–13, presents tools that should be quite interesting to most readers and will make it easier to follow along. Many advanced topics are presented, and it is natural to be more interested in some topics than others; for instance, you might be interested in analyzing geographic datasets, or perhaps you’re more interested in processing real- time datasets, or maybe you’d like to do both! Based on your personal interests or problems at hand, we encourage you to execute the code examples that are most relevant to you. All of the code in these chapters is
📄 Page
10
written to be executed on your personal computer, but you are also encouraged to use proper Spark clusters given that you’ll have the tools required to troubleshoot issues and tune large-scale computations. Formatting Tables generated from code are formatted as follows: # A tibble: 3 x 2 numbers text <dbl> <chr> 1 1 one 2 2 two 3 3 three The dimensions of the table (number of rows and columns) are described in the first row, followed by column names in the second row and column types in the third row. There are also various subtle visual improvements provided by the tibble package that we make use of throughout this book. Most plots are rendered using the ggplot2 package and a custom theme available in the appendix; however, because this book is not focused on data visualization, we only provide code to render a basic plot that won’t match the formatting we applied. If you are interested in learning more about visualization in R, consider specialized books like R Graphics Cookbook (O’Reilly). Acknowledgments We thank the package authors that enabled Spark with R: Javier Luraschi, Kevin Kuo, Kevin Ushey, and JJ Allaire (sparklyr); Romain François and Hadley Wickham (dbplyr); Hadley Wickham and Edgar Ruiz (dpblyr); Kirill Mülller (DBI); and the authors of the Apache Spark project itself, and its original author Matei Zaharia.
📄 Page
11
We thank the package authors that released extensions to enrich the Spark and R ecosystem: Akhil Nair (crassy); Harry Zhu (geospark); Kevin Kuo (graphframes, mleap, sparktf, and sparkxgb); Jakub Hava, Navdeep Gill, Erin LeDell, and Michal Malohlava (rsparkling); Jan Wijffels (spark.sas7bdat); Aki Ariga (sparkavro); Martin Studer (sparkbq); Matt Pollock (sparklyr.nested); Nathan Eastwood (sparkts); and Samuel Macêdo (variantspark). We thank our wonderful editor, Melissa Potter, for providing us with guidance, encouragement, and countless hours of detailed feedback to make this book the best we could have ever written. To Bradley Boehmke, Bryan Adams, Bryan Jonas, Dusty Turner, and Hossein Falaki, we thank you for your technical reviews, time, and candid feedback, and for sharing your expertise with us. Many readers will have a much more pleasant experience thanks to you. Thanks to RStudio, JJ Allaire, and Tareef Kawaf for supporting this work, and the R community itself for its continuous support and encouragement. Max Kuhn, thank you for your invaluable feedback on Chapter 4, in which, with his permission, we adapted examples from his wonderful book Feature Engineering and Selection: A Practical Approach for Predictive Models (CRC Press). We also thank everyone indirectly involved but not explicitly listed in this section; we are truly standing on the shoulders of giants. This book itself was written in R using bookdown by Yihui Xie, rmarkdown by JJ Allaire and Yihui Xie, and knitr by Yihui Xie; we drew the visualizations using ggplot2 by Hadley Wickham and Winston Chang; we created the diagrams using nomnoml by Daniel Kallin and Javier Luraschi; and we did the document conversions using pandoc by John MacFarlane. Conventions Used in This Book
📄 Page
12
The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context. TIP This element signifies a tip or suggestion. NOTE This element signifies a general note. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/r-spark/the-r-in-spark. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and
📄 Page
13
documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Mastering Spark with R by Javier Luraschi, Kevin Kuo, and Edgar Ruiz (O’Reilly). Copyright 2020 Javier Luraschi, Kevin Kuo, and Edgar Ruiz, 978-1-492-04637-0.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning NOTE For almost 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com. How to Contact Us
📄 Page
14
Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/SparkwithR. To comment or ask technical questions about this book, send email to bookquestions@oreilly.com. For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia
📄 Page
15
Chapter 1. Introduction You know nothing, Jon Snow. —Ygritte With information growing at exponential rates, it’s no surprise that historians are referring to this period of history as the Information Age. The increasing speed at which data is being collected has created new opportunities and is certainly poised to create even more. This chapter presents the tools that have been used to solve large-scale data challenges. First, it introduces Apache Spark as a leading tool that is democratizing our ability to process large datasets. With this as a backdrop, we introduce the R computing language, which was specifically designed to simplify data analysis. Finally, this leads us to introduce sparklyr, a project merging R and Spark into a powerful tool that is easily accessible to all. Chapter 2, Getting Started presents the prerequisites, tools, and steps you need to perform to get Spark and R working on your personal computer. You will learn how to install and initialize Spark, get introduced to common operations, and get your very first data processing and modeling task done. It is the goal of that chapter to help anyone grasp the concepts and tools required to start tackling large-scale data challenges which, until recently, were accessible to just a few organizations. You then move into learning how to analyze large-scale data, followed by building models capable of predicting trends and discover information hidden in vast amounts of information. At which point, you will have the tools required to perform data analysis and modeling at scale. Subsequent chapters help you move away from your local computer into computing clusters required to solve many real world problems. The last chapters present additional topics, like real-time data processing and graph analysis, which you will need to truly master the art of analyzing data at any scale. The
📄 Page
16
last chapter of this book provides you with tools and inspiration to consider contributing back to the Spark and R communities. We hope that this is a journey you will enjoy, that will help you to solve problems in your professional career, and to nudge the world into making better decisions that can benefit us all. Overview As humans, we have been storing, retrieving, manipulating, and communicating information since the Sumerians in Mesopotamia developed writing around 3000 BC. Based on the storage and processing technologies employed, it is possible to distinguish four distinct phases of development: premechanical (3000 BC to 1450 AD), mechanical (1450–1840), electromechanical (1840–1940), and electronic (1940–present). Mathematician George Stibitz used the word digital to describe fast electric pulses back in 1942, and to this day, we describe information stored electronically as digital information. In contrast, analog information represents everything we have stored by any nonelectronic means such as handwritten notes, books, newspapers, and so on. The World Bank report on digital development provides an estimate of digital and analog information stored over the past decades. This report noted that digital information surpassed analog information around 2003. At that time, there were about 10 million terabytes of digital information, which is roughly about 10 million storage drives today. However, a more relevant finding from this report was that our footprint of digital information is growing at exponential rates. Figure 1-1 shows the findings of this report; notice that every other year, the world’s information has grown tenfold. With the ambition to provide tools capable of searching all of this new digital information, many companies attempted to provide such functionality with what we know today as search engines, used when searching the web. Given the vast amount of digital information, managing information at this scale was a challenging problem. Search engines were unable to store all of 1 2 3
📄 Page
17
the web page information required to support web searches in a single computer. This meant that they had to split information into several files and store them across many machines. This approach became known as the Google File System, and was presented in a research paper published in 2003 by Google. Figure 1-1. World’s capacity to store information Hadoop One year later, Google published a new paper describing how to perform operations across the Google File System, an approach that came to be known as MapReduce. As you would expect, there are two operations in MapReduce: map and reduce. The map operation provides an arbitrary way to transform each file into a new file, whereas the reduce operation combines two files. Both operations require custom computer code, but the MapReduce framework takes care of automatically executing them across many computers at once. These two operations are sufficient to process all the data available on the web, while also providing enough flexibility to extract meaningful information from it. For example, as illustrated in Figure 1-2, we can use MapReduce to count words in two different text files stored in different machines. The map operation splits each word in the original file and outputs a new word- 4 5
📄 Page
18
counting file with a mapping of words and counts. The reduce operation can be defined to take two word-counting files and combine them by aggregating the totals for each word; this last file will contain a list of word counts across all the original files. Counting words is often the most basic MapReduce example, but we can also use MapReduce for much more sophisticated and interesting applications. For instance, we can use it to rank web pages in Google’s PageRank algorithm, which assigns ranks to web pages based on the count of hyperlinks linking to a web page and the rank of the page linking to it. Figure 1-2. MapReduce example counting words across files After these papers were released by Google, a team at Yahoo worked on implementing the Google File System and MapReduce as a single open source project. This project was released in 2006 as Hadoop, with the Google File System implemented as the Hadoop Distributed File System (HDFS). The Hadoop project made distributed file-based computing accessible to a wider range of users and organizations, making MapReduce useful beyond web data processing.
📄 Page
19
Although Hadoop provided support to perform MapReduce operations over a distributed file system, it still required MapReduce operations to be written with code every time a data analysis was run. To improve upon this tedious process, the Hive project, released in 2008 by Facebook, brought Structured Query Language (SQL) support to Hadoop. This meant that data analysis could now be performed at large scale without the need to write code for each MapReduce operation; instead, one could write generic data analysis statements in SQL, which are much easier to understand and write. Spark In 2009, Apache Spark began as a research project at UC Berkeley’s AMPLab to improve on MapReduce. Specifically, Spark provided a richer set of verbs beyond MapReduce to facilitate optimizing code running in multiple machines. Spark also loaded data in-memory, making operations much faster than Hadoop’s on-disk storage. One of the earliest results showed that running logistic regression, a data modeling technique that we will introduce in Chapter 4, allowed Spark to run 10 times faster than Hadoop by making use of in-memory datasets. A chart similar to Figure 1-3 was presented in the original research publication. Figure 1-3. Logistic regression performance in Hadoop and Spark 6
📄 Page
20
Even though Spark is well known for its in-memory performance, it was designed to be a general execution engine that works both in-memory and on- disk. For instance, Spark has set new records in large-scale sorting, for which data was not loaded in-memory; rather, Spark made improvements in network serialization, network shuffling, and efficient use of the CPU’s cache to dramatically enhance performance. If you needed to sort large amounts of data, there was no other system in the world faster than Spark. To give you a sense of how much faster and efficient Spark is, it takes 72 minutes and 2,100 computers to sort 100 terabytes of data using Hadoop, but only 23 minutes and 206 computers using Spark. In addition, Spark holds the cloud sorting record, which makes it the most cost-effective solution for sorting large-datasets in the cloud. Hadoop record Spark record Data size 102.5 TB 100 TB Elapsed time 72 mins 23 mins Nodes 2,100 206 Cores 50,400 6,592 Disk 3,150 GB/s 618 GB/s Network 10 GB/s 10 GB/s Sort rate 1.42 TB/min 4.27 TB/min Sort rate/node 0.67 GB/min 20.7 GB/min Spark is also easier to use than Hadoop; for instance, the word-counting MapReduce example takes about 50 lines of code in Hadoop, but it takes only 2 lines of code in Spark. As you can see, Spark is much faster, more efficient, and easier to use than Hadoop. In 2010, Spark was released as an open source project and then donated to the Apache Software Foundation in 2013. Spark is licensed under Apache 2.0, which allows you to freely use, modify, and distribute it. Spark then
The above is a preview of the first 20 pages. Register to read the complete e-book.
Recommended for You
Loading recommended books...
Failed to load, please try again later