Python and R for the Modern Data Scientist The Best of Both Worlds (Rick J. Scavetta, Boyan Angelov) (Z-Library)

Rick J. Scavetta & Boyan Angelov Python and R for the Modern Data Scientist The Best of Both Worlds

(This page has no text content)

Rick J. Scavetta and Boyan Angelov Python and R for the Modern Data Scientist The Best of Both Worlds Boston Farnham Sebastopol TokyoBeijing

978-1-492-09340-4 [LSI] Python and R for the Modern Data Scientist by Rick J. Scavetta and Boyan Angelov Copyright © 2021 Boyan Angelov and Rick J. Scavetta. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Michelle Smith Development Editor: Angela Rufino Production Editor: Katherine Tozer Copyeditor: Tom Sullivan Proofreader: Piper Editorial Consulting, LLC Indexer: Sam Arnold-Boyd Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea July 2021: First Edition Revision History for the First Edition 2021-06-22: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492093404 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Python and R for the Modern Data Scientist, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

To my parents for giving me the best possible start in life. To my wife for being my rock. To my children for having the brightest dreams of the future. –Boyan Angelov For all of us ready, willing, and able to perceive the world through a wider lens, eliminat‐ ing “the other” in our midst. –Rick Scavetta

(This page has no text content)

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Part I. Discovery of a New Language 1. In the Beginning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Origins of R 3 The Origins of Python 5 The Language War Begins 6 The Battle for Data Science Dominance 7 A Convergence on Cooperation and Community-Building 9 Final Thoughts 9 Part II. Bilingualism I: Learning a New Language 2. R for Pythonistas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Up and Running with R 14 Projects and Packages 15 The Triumph of Tibbles 21 A Word About Types and Exploring 24 Naming (Internal) Things 27 Lists 28 The Facts About Factors 31 How to Find…Stuff 32 Reiterations Redo 39 Final Thoughts 41 v

3. Python for UseRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Versions and Builds 44 Standard Tooling 46 Virtual Environments 50 Installing Packages 55 Notebooks 57 How Does Python, the Language, Compare to R? 58 Import a Dataset 60 Examine the Data 60 Data Structures and Descriptive Statistics 62 Data Structures: Back to the Basics 63 Indexing and Logical Expressions 65 Plotting 66 Inferential Statistics 67 Final Thoughts 68 Part III. Bilingualism II: The Modern Context 4. Data Format Context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 External Versus Base Packages 74 Image Data 77 Text Data 82 Time Series Data 86 Base R 86 Prophet 88 Spatial Data 89 Final Thoughts 91 5. Workflow Context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Defining Workflows 94 Exploratory Data Analysis 95 Static Visualizations 96 Interactive Visualizations 98 Machine Learning 100 Data Engineering 105 Reporting 109 Static Reporting 109 Interactive Reporting 110 Final Thoughts 113 vi | Table of Contents

Part IV. Bilingualism III: Becoming Synergistic 6. Using the Two Languages Synergistically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Faux Operability 117 Interoperability 120 Going Deeper 125 Pass Objects Between R and Python in an R Markdown Document 125 Call Python in an R Markdown Document 127 Call Python by Sourcing a Python Script 127 Call Python Using the REPL 128 Call Python with Dynamic Input in an Interactive Document 129 Final Thoughts 130 7. A Case Study in Bilingual Data Science. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 24 Years and 1.88 Million Wildfires 131 Setup and Importing Data 134 EDA and Data Visualization 136 Machine Learning 141 Setting Up Our Python Environment 141 Feature Engineering 142 Model Training 142 Prediction and UI 144 Final Thoughts 145 A Python:R Bilingual Dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Table of Contents | vii

(This page has no text content)

Preface Why We Wrote This Book We want to show data scientists why being more aware, informed, and deliberate about their tools is an optimal strategy for increased productivity. With this goal in mind, we didn’t write a bilingual dictionary (well, not only—you’ll find that handy resource in the Appendix). Ongoing discussions about Python versus R (the so-called “language wars”) have long since ceased to be productive. It recalls, for us, Maslow’s hammer: “if all you have is a hammer, everything looks like a nail.” It’s a fantasy worldview set in absolutes, where one tool offers an all-encompassing solution. Real- world situations are context-dependent, and a craftsperson knows that tools should be chosen appropriately. We aim to showcase a new way of working by taking advan‐ tage of all the great data science tools available, regardless of the language they are written in. Thus we aim to develop both how the modern data scientist thinks and works. We chose the word modern in the title not just to signify novelty in our approach. It allows us to take a more nuanced stance in how we discuss our tools. What do we mean by modern data science? Modern data science is: Collective It does not exist in isolation. It’s integrated into wider networks, such as a team or organization. We avoid jargon when it creates barriers and embrace it when it builds bridges (see “Technical Interactions” on page x). Simple We aim to reduce unnecessary complexity in our methods, code, and communications. Accessible It’s an open design process that can be evaluated, understood, and optimized. ix

Generalizable Its fundamental tools and concepts are applicable to many domains. Outward looking It incorporates, is informed by, and is influenced by developments in other fields. Ethical and honest It’s people-oriented. It takes best practices for ethical work, as well as a broader view of its consequences, for communities and society, into account. We avoid hype, fads, and trends that only serve short-term gains. However the actual job description of a data scientist evolves in the coming years, we can expect that these timeless principles will provide a strong foundation. Technical Interactions Accepting that the world is more extensive, more diverse, and more complex than any single tool can serve presents a challenge that is best addressed directly and early. This broadened perspective results in an increase in technical interactions. We must consider the programming language, packages, naming conventions, project file architecture, integrated development environments (IDEs), text editors, and on and on that will best suit the situation. Diversity gives rise to complexity and confusion. The more diverse our ecosystem becomes, the more important it is to consider whether our choices act as bridges or barriers. We must always strive to make choices that build bridges with our colleagues and communities and avoid those that build barriers that isolate us and make us inflexible. There is plenty of room to contain all the diversity of choices we’ll encounter. The challenge in each situation is to make choices that balance personal preference and communal accessibility. This challenge is found in all technical interactions. Aside from tool choice (a “hard” skill), it also includes communication (a “soft” skill). The content, style, and medium of communication, to name just a few considerations, also act as bridges or barriers to a specific audience. Becoming bilingual in both Python and R is a step toward building bridges among members of the wider data science community. Who This Book Is For This book aims at data scientists at the intermediate stage of their careers. As such, it doesn’t attempt to teach data science. Nonetheless, early-career data scientists will also benefit from this book by learning what’s possible in a modern data science context before committing to any topic, tool, or language. x | Preface

1 Etymology is the study of word origins and meanings. Our goal is to bridge the gap between the Python and R communities. We want to move away from a tribal, “us versus them” mentality and toward a unified, productive community. Thus, this book is for those data scientists who see the benefit of expand‐ ing their skill set and thereby their perspectives and the value that their work can add to all variety of data science projects. It’s negligent to ignore the powerful tools available to us. We strive to be open to new, productive ways of achieving our programming goals and encourage our colleagues to get out of their comfort zone. In addition, Part II and the Appendix also serve as useful references for those moments when you just need to quickly map something familiar in one language onto the other. Prerequisites To obtain the best value from this book, we assume the reader is familiar with at least one of the main programming languages in data science, Python and R. A reader with knowledge of a closely related one, such as Julia or Ruby, can also derive good value. Basic familiarity with general areas of data science work, such as data munging, data visualization, and machine learning is beneficial, but not necessary, to appreciate the examples, workflow scenarios, and case study. How This Book Is Organized We’ve organized this book as if we’re learning a second spoken language as an adult. In Part I we begin by going back in time to the origins of the two languages and show how this has influenced the current state by covering key breakthroughs. In our anal‐ ogy with spoken languages, this helps provide a bit of context as to why we have quirks such as irregular verbs and plural endings. Etymology is interesting and helps you gain an appreciation of a language, like the seemingly endless forms of plural nouns in German, but it’s certainly not essential for speaking.1 If you want to get right into the languages, skip straight to Part II. Part II provides a deeper dive into the dialects of both languages by offering a mir‐ rored perspective. First we will cover how a Python user should approach work with R, and then the other way around. This will expand not only your skill set but also your way of thinking as you appreciate how each language operates. Preface | xi

In this part, we’ll treat each language separately as we start to become bilingual. Just like becoming bilingual in a spoken language, we need to resist two defeating urges. The first urge is to point out how much more straightforward, or more elegant, or in some way “better,” something is in our mother tongue. Congratulations to you, but that’s not the point of learning a new language, is it? We’re going to learn each lan‐ guage in its own right. Although we’ll point out comparisons as we go along, they’ll help us deal with our native-language baggage. The second urge is to constantly try to interpret literally and word for word between two languages. This prevents us from thinking (or even dreaming) in the new lan‐ guage, and sometimes it’s just not possible! Examples I like to use are phrasing such as das schmeckt mir in German, or ho fame in Italian, which translate literally very poorly as “that tastes to me” (That tastes good) and “I have hunger” (I’m hungry). The point is, different languages allow for different constructs. This gives us new tools to work with and new ways to think, once we realize that we can’t map everything 1:1 onto our previous knowledge. Think of these chapters as our first step to mapping your knowledge of one language onto the other. Part III covers the modern context of language applications. This includes a review of the broad ecosystem of open source packages as well as the variety of workflow- specific methods. This part will demonstrate when one language is preferred and why, although they’ll still be separate languages at this point. This will help you to decide which language to use for parts of a large data science project. In spoken languages, lost in translation is a real thing. Some things just work better in one language. In German, mir ist heiß and ich bin heiß are both “I’m hot” in English, but a German speaker will distinguish hotness from the weather versus physique. Other words like Schadenfreude, a compound word from “schaden” (damage) and “freude” (pleasure) meaning to take pleasure in someone’s difficulties, or Kummer‐ speck, a compound word from “kummer” (grief) and “speck” (bacon) referring to the weight gained due to emotional eating, are just so perfect there’s no use in translating them. Part IV details the modern interfaces that exist between the languages. First, we became bilingual, using each language in isolation. Then, we identified how to choose one language over another. Now, we’ll explore tools that take us from separate and interconnected Python and R scripts to single scripts that weave the two languages together in a single workflow. The real fun starts when you’re not just bilingual, but working within a bilingual community. Not only can you communicate in each language independently, but you can also combine them in novel ways that only other bilingual speakers will appreci‐ ate and understand. Bilingualism doesn’t just provide access to a new community but also creates in itself a new community. For purists, this is pure torture, but I hope we’ve moved beyond that. Bilinguals can appreciate the warning “The Ordnungsamt is xii | Preface

monitoring Bergmannkiez today.” Ideally you’re not substituting words because you’ve forgotten them, but because it’s the best choice for the situation. There’s no great translation of Orgnungsamt (regulatory agency?) and Bergmannkiez is a neighbor‐ hood in Berlin that shouldn’t be translated anyways. Sometimes words in one lan‐ guage more easily convey a message, like Mundschutzpflicht, the obligatory wearing of face masks during the coronavirus pandemic. Finally, Chapter 7 consists of a case study that will outline how a modern data science project can be implemented based on the material covered in this book. Here, we’ll see all the previous sections come together in one workflow. Let’s Talk The field of data science is continuously evolving, and we hope that this book will help you navigate easily between Python and R. We’re excited to hear what you think, so let us know how your work has changed! You can contact us via the companion website for the book. There you’ll find updated extra content and a handy Python/R bilingual cheat sheet. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. Preface | xiii

This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/moderndatadesign/PyR4MDS. If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Python and R for the Modern Data Scientist by Rick J. Scavetta and Boyan Angelov (O’Reilly). Copyright 2021 Boyan Angelov and Rick J. Scavetta, 978-1-492-09340-4.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. xiv | Preface

Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/python-and-r-data-science. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit http://oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments The authors acknowledge the contribution of many individuals who have helped make this book possible. At O’Reilly, we thank Michelle Smith, a senior content acquisitions editor with unpar‐ alleled passion, breadth of knowledge and foresight with whom we were fortunate enough to work with. We thank Angela Rufino, our content development editor for keeping us on track during the writing process and lifting up our spirits with a literal wall of action heroes and kind words of encouragement. We are grateful to Katie Tozer, our production editor, for her patience and fastidious treatment of our manu‐ script. We are grateful to Robert Romano and the design team at O’Reilly. They not only aided in redrawing figures but also selected, as per our wishes, a vibrant, com‐ manding and truly impressive colossal squid for the cover! We also thank Chris Stone and the engineering team for technical help. Preface | xv

A special thanks goes out to the countless unseen individuals working behind the scenes at O’Reilly. We appreciate the amount of effort needed to make excellent and relevant content available. We are also indebted to our technical reviewers, who gave generously of both their time and insightful comments borne from experience: Eric Pite and Ian Flores at RStudio, our fellow O’Reilly authors Noah Gift and George Mount, and the impecca‐ ble author Walter R. Paczkowski. Your comments were well received and improved the book immensely. Rick also thanks all his students, both online and in-person, from the past 10 years. Every chance to pass on knowledge and understanding reaffirmed the value of teach‐ ing and allowed, however slight, a contribution to the great scientific endeavor. Rick is also thankful for the dedicated administrative support that allows him to maintain an active relationship with primary scientists around the world. Finally, we extend a heartfelt thanks to not only Python and R developers but also to the broad, interconnected community of open source developers. Their creativity, dedication, and passion are astounding. It is difficult to consider how the data science landscape would look without the collective efforts of thousands of developers work‐ ing together, crossing all borders, and spanning decades. Almost nothing in this book would be possible without their contributions! xvi | Preface

PART I Discovery of a New Language To get things started, we’ll review the history of both Python and R. By comparing and contrasting these origin stories, you’ll better appreciate the current state of each language in the data science landscape. If you want to get started with coding, feel free skip ahead to Part II.

(This page has no text content)

Statistics

Uploader

Python and R for the Modern Data Scientist The Best of Both Worlds (Rick J. Scavetta, Boyan Angelov) (Z-Library)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Blog & Notes

Recommended for You

Statistics

Uploader

Python and R for the Modern Data Scientist The Best of Both Worlds (Rick J. Scavetta, Boyan Angelov) (Z-Library)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Blog & Notes

Recommended for You