📄 Page
1
(This page has no text content)
📄 Page
2
Praise for Data Science at the Command Line Traditional computer and data science curricula all too often mistake the command line as an obsolete relic instead of teaching it as the modern and vital toolset that it is. Only well into my career did I come to grasp the elegance and power of the command line for easily exploring messy datasets and even creating reproducible data pipelines for work. The first edition of Data Science at the Command Line was one of the most comprehensive and clear references when I was a novice in the art, and now with the second edition, I’m again learning new tools and applications from it. —Dan Nguyen, data scientist, former news application developer at ProPublica, and former Lorry I. Lokey Visiting Professor in Professional Journalism at Stanford University The Unix philosophy of simple tools, each doing one job well, then cleverly piped together, is embodied by the command line. Jeroen expertly discusses how to bring that philosophy into your work in data science, illustrating how the command line is not only the world of file input/output, but also the world of data manipulation, exploration, and even modeling. —Chris H. Wiggins, associate professor in the department of applied physics and applied mathematics at Columbia University, and chief data scientist at The New York Times This book explains how to integrate common data science tasks into a coherent workflow. It’s not just about tactics for breaking down problems, it’s also about strategies for assembling the pieces of the solution. —John D. Cook, consultant in applied mathematics, statistics, and technical computing
📄 Page
3
Despite what you may hear, most practical data science is still focused on interesting visualizations and insights derived from flat files. Jeroen’s book leans into this reality, and helps reduce complexity for data practitioners by showing how time-tested command-line tools can be repurposed for data science. —Paige Bailey, principal product manager code intelligence at Microsoft, GitHub It’s amazing how fast so much data work can be performed at the command line before ever pulling the data into R, Python, or a database. Older technologies like sed and awk are still incredibly powerful and versatile. Until I read Data Science at the Command Line, I had only heard of these tools but never saw their full power. Thanks to Jeroen, it’s like I now have a secret weapon for working with large data. —Jared Lander, chief data scientist at Lander Analytics, organizer of the New York Open Statistical Programming Meetup, and author of R for Everyone The command line is an essential tool in every data scientist’s toolbox, and knowing it well makes it easy to translate questions you have of your data to real-time insights. Jeroen not only explains the basic Unix philosophy of how to chain together single-purpose tools to arrive at simple solutions for complex problems, but also introduces new command-line tools for data cleaning, analysis, visualization, and modeling. —Jake Hofman, senior principal researcher at Microsoft Research, and adjunct assistant professor in the department of applied mathematics at Columbia University
📄 Page
4
Data Science at the Command Line SECOND EDITION Obtain, Scrub, Explore, and Model Data with Unix Power Tools Jeroen Janssens
📄 Page
5
Data Science at the Command Line by Jeroen Janssens Copyright © 2021 Jeroen Janssens. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jessica Haberman Development Editor: Sarah Grey Production Editor: Kate Galloway Copyeditor: Arthur Johnson Proofreader: Shannon Turlington Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea October 2014: First Edition August 2021: Second Edition Revision History for the Second Edition 2021-08-17: First Release
📄 Page
6
See http://oreilly.com/catalog/errata.csp?isbn=9781492087915 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Science at the Command Line, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. Data Science at the Command Line is available under the Creative Commons Attribution NonCommercial-No Derivatives 4.0 International License. The author maintains an online version at https://github.com/jeroenjanssens/data- science-at-the-command-line. 978-1-492-08791-5 [LSI]
📄 Page
7
Once again to my wife, Esther. Without her continued encouragement, support, and patience, this second edition would surely have ended up in /dev/null.
📄 Page
8
Foreword It was love at first sight. It must have been around 1981 or 1982 that I got my first taste of Unix. Its command-line shell, which uses the same language for single commands and complex programs, changed my world, and I never looked back. I was a writer who had discovered the joys of computing, and regular expressions were my gateway drug. I’d first tried them in the text editor in HP’s RTE operating system, but it was only when I came to Unix and its philosophy of small cooperating tools with the command-line shell as the glue that tied them together that I fully understood their power. Regular expressions in ed, ex, vi (now vim), and emacs were powerful, sure, but it wasn’t until I saw how ex scripts unbound became sed, the Unix stream editor, and then AWK, which allowed you to bind programmed actions to regular expressions, and how shell scripts let you build pipelines not only out of the existing tools but out of new ones you’d written yourself, that I really got it. Programming is how you speak with computers, how you tell them what you want them to do, not just once, but in ways that persist, in ways that can be varied like human language, with repeatable structure but different verbs and objects. As a beginner, other forms of programming seemed more like recipes to be followed exactly—careful incantations where you had to get everything right—or like waiting for a teacher to grade an essay you’d written. With shell programming, there was no compilation and waiting. It was more like a conversation with a friend. When the friend didn’t understand, you could easily try again. What’s more, if you had something simple to say, you could just say it with one word. And there were already words for a whole lot of the things you might want to say. But if there weren’t, you could easily make up new words. And you could string together the words you learned and the words you made up into gradually more
📄 Page
9
complex sentences, paragraphs, and eventually get to persuasive essays. Almost every other programming language is more powerful than the shell and its associated tools, but for me at least, none provides an easier pathway into the programming mindset, and none provides a better environment for a kind of everyday conversation with the machines that we ask to help us with our work. As Brian Kernighan, one of the creators of AWK as well as the coauthor of the marvelous book The Unix Programming Environment, said in an interview with Lex Fridman, “[Unix] was meant to be an environment where it was really easy to write programs.” [00:23:10] Kernighan went on to explain why he often still uses AWK rather than writing a Python program when he’s exploring data: “It doesn’t scale to big programs, but it does pretty darn well on these little things where you just want to see all the somethings in something.” [00:37:01] In Data Science at the Command Line, Jeroen Janssens demonstrates just how powerful the Unix/Linux approach to the command line is even today. If Jeroen hadn’t already done so, I’d write an essay here about just why the command line is such a sweet and powerful match with the kinds of tasks so often encountered in data science. But he already starts out this book by explaining that. So I’ll just say this: the more you use the command line, the more often you will find yourself coming back to it as the easiest way to do much of your work. And whether you’re a shell newbie, or just someone who hasn’t thought much about what a great fit shell programming is for data science, this is a book you will come to treasure. Jeroen is a great teacher, and the material he covers is priceless. Tim O’Reilly May 2021
📄 Page
10
Preface Data science is an exciting field to work in. It’s also still relatively young. Unfortunately, many people, and many companies as well, believe that you need new technology to tackle the problems posed by data science. However, as this book demonstrates, many things can be accomplished by using the command line instead, and sometimes in a much more efficient way. During my PhD program, I gradually switched from using Microsoft Windows to using Linux. Because this transition was a bit scary at first, I started with having both operating systems installed next to each other (known as a dual-boot). The urge to switch back and forth between Microsoft Windows and Linux eventually faded, and at some point I was even tinkering around with Arch Linux, which allows you to build up your own custom Linux machine from scratch. All you’re given is the command line, and it’s up to you what to make of it. Out of necessity, I quickly became very comfortable using the command line. Eventually, as spare time got more precious, I settled down with a Linux distribution known as Ubuntu because of its ease of use and large community. However, the command line is still where I’m spending most of my time. It actually wasn’t too long ago that I realized that the command line is not just for installing software, configuring systems, and searching files. I started learning about tools such as cut, sort, and sed. These are examples of command- line tools that take data as input, do something to it, and print the result. Ubuntu comes with quite a few of them. Once I understood the potential of combining these small tools, I was hooked.
📄 Page
11
After earning my PhD, when I became a data scientist, I wanted to use this approach to do data science as much as possible. Thanks to a couple of new, open source command- line tools including xml2json, jq, and json2csv, I was even able to use the command line for tasks such as scraping websites and processing lots of JSON data. In September 2013, I decided to write a blog post titled “7 Command-Line Tools for Data Science”. To my surprise, the blog post got quite some attention, and I received a lot of suggestions of other command-line tools. I started wondering whether the blog post could be turned into a book. I was pleased that, some 10 months later, and with the help of many talented people (see the acknowledgments), the answer was yes. I am sharing this personal story not so much because I think you should know how this book came about, but because I want to you know that I had to learn about the command line as well. Because the command line is so different from using a graphical user interface, it can seem scary at first. But if I could learn it, then you can as well. No matter what your current operating system is and no matter how you currently work with data, after reading this book you will be able to do data science at the command line. If you’re already familiar with the command line, or even if you’re already dreaming in shell scripts, chances are that you’ll still discover a few interesting tricks or command-line tools to use for your next data science project. What to Expect from This Book In this book, we’re going to obtain, scrub, explore, and model data—a lot of it. This book is not so much about how to become better at those data science tasks. There are already great resources available that discuss, for example, when to apply which statistical test or how data can best be visualized. Instead, this practical book aims to make you more efficient
📄 Page
12
and productive by teaching you how to perform those data science tasks at the command line. While this book discusses more than 90 command-line tools, it’s not the tools themselves that matter most. Some command- line tools have been around for a very long time, while others will be replaced by better ones. New command-line tools are being created even as you’re reading this. Over the years, I have discovered many amazing command-line tools. Unfortunately, some of them were discovered too late to be included in the book. In short, command-line tools come and go. But that’s OK. What matters most is the underlying idea of working with tools, pipes, and data. Most command-line tools do one thing and do it well. This is part of the Unix philosophy, which makes several appearances throughout the book. Once you have become familiar with the command line, know how to combine command-line tools, and can even create new ones, you have developed an invaluable skill. Changes for the Second Edition While the command line as a technology and as a way of working is timeless, some of the tools discussed in the first edition have either been superseded by newer tools (e.g., csvkit has largely been replaced by xsv) or abandoned by their developers (e.g., drake), or they’ve been suboptimal choices (e.g., weka). I have learned a lot since the first edition was published in October 2014, either through my own experience or as a result of the useful feedback from my readers. Even though the book is quite niche because it lies at the intersection of two subjects, there remains a steady interest from the data science community, as evidenced by the many positive messages I receive almost every day. By updating the first edition, I hope to keep the book relevant for at least
📄 Page
13
another five years. Here’s a nonexhaustive list of changes I have made: I replaced csvkit with xsv as much as possible. xsv is a faster alternative to working with CSV files. In Chapters 2 and 3, I replaced the VirtualBox image with a Docker image. Docker is a faster and more lightweight way of running an isolated environment. I now use pup instead of scrape to work with HTML. scrape is a Python tool I created myself. pup is much faster, has more features, and is easier to install. Chapter 6 has been rewritten from scratch. Instead of drake, I now use make to do project management. drake is no longer maintained, and make is much more mature and very popular with developers. I replaced Rio with rush. Rio is a clunky Bash script I created myself. rush is an R package that is a much more stable and flexible way of using R from the command line. In Chapter 9 I replaced Weka and BigML with Vowpal Wabbit (vw). Weka is old, and the way it is used from the command line is clunky. BigML is a commercial API that I no longer want to rely on. Vowpal Wabbit is a very mature machine learning tool that was developed at Yahoo! and is now at Microsoft. Chapter 10 is an entirely new chapter about integrating the command line into existing workflows, including Python, R, and Apache Spark. In the first edition I mentioned that the command line can easily be integrated with existing workflows but never delved into the topic. This chapter fixes that.
📄 Page
14
How to Read This Book In general, I advise you to read this book in a linear fashion. Once a concept or command-line tool has been introduced, chances are that I employ it in a later chapter. For example, in Chapter 9, I make heavy use of parallel, which is discussed extensively in Chapter 8. Data science is a broad field that intersects many other fields such as programming, data visualization, and machine learning. As a result, this book touches on many interesting topics that unfortunately cannot be discussed at great length. At the end of each chapter, I provide suggestions for further exploration. It’s not required that you read this material in order to follow along with the book, but if you are interested, just know that there’s much more to learn.
📄 Page
15
Who This Book Is For This book makes just one assumption about you: that you work with data. It doesn’t matter which programming language or statistical computing environment you’re currently using. The book explains all the necessary concepts from the beginning. It also doesn’t matter whether your operating system is Microsoft Windows, macOS, or some flavor of Linux. The book comes with a Docker image, which is an easy-to-install virtual environment. It allows you to run the command-line tools and follow along with the code examples in the same environment as this book was written. You don’t have to waste time figuring out how to install all the command-line tools and their dependencies. The book contains some code in Bash, Python, and R, so it’s helpful if you have some programming experience, but it’s by no means required to follow along with the examples. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, directory names, and filenames. Constant width Used for code and commands, as well as within paragraphs to refer to command-line tools and their options. Constant width bold Shows commands or other text that should be typed literally by the user.
📄 Page
16
Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context. TIP This element signifies a tip or suggestion. NOTE This element signifies a general note. WARNING This element indicates a warning or caution. O’Reilly Online Learning NOTE For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North
📄 Page
17
Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this at https://oreil.ly/data-science-at-cl. Email bookquestions@oreilly.com to comment or ask technical questions about this book. The author also maintains a version of the book online. For news and information about our books and courses, visit http://oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://youtube.com/oreillymedia Acknowledgments for the Second Edition (2021) Seven years have passed since the first edition came out. During this time, and especially during the last 13 months, many people have helped me. Without them, I would have never been able to write a second edition. I was once again blessed with three wonderful editors at O’Reilly. I would like to thank Sarah “Embrace the deadline” Grey, Jess “Pedal to the metal” Haberman, and Kate “Let it go” Galloway. Their middle names say it all. With their incredible help, I was able to embrace the deadlines, put the pedal to metal when it mattered, and eventually let it go. I’d also like to thank their colleagues Angela Rufino, Arthur
📄 Page
18
Johnson, Cassandra Furtado, David Futato, Helen Monroe, Karen Montgomery, Kate Dullea, Kristen Brown, Marie Beaugureau, Marsee Henon, Nick Adams, Regina Wilkinson, Shannon Cutt, Shannon Turlington, and Yasmina Greco, for making the collaboration with O’Reilly such a pleasure. Despite having an automated process to execute the code and paste back the results (thanks to R Markdown and Docker), the number of mistakes I was able to make is impressive. Thank you Aaditya Maruthi, Brian Eoff, Caitlin Hudon, Julia Silge Mike Dewar, and Shane Reustle for reducing this number immensely. Of course, any mistakes left are my responsibility. Marc Canaleta deserves a special thank you. In October 2014, shortly after the first edition came out, Marc invited me to give a one-day workshop about Data Science at the Command Line to his team at Social Point in Barcelona. Little did we both know that many workshops would follow. It eventually led me to start my own company: Data Science Workshops. Every time I teach, I learn something new. They probably don’t know it, but each student has had an impact, in one way or another, on this book. To them I say: thank you. I hope I can teach for a very long time. Captivating conversations, splendid suggestions, and passionate pull requests. I greatly appreciate each and every contribution by following generous people: Adam Johnson, Andre Manook, Andrea Borruso, Andres Lowrie, Andrew Berisha, Andrew Gallant, Andrew Sanchez, Anicet Ebou, Anthony Egerton, Ben Isenhart, Chris Wiggins, Chrys Wu, Dan Nguyen, Darryl Amatsetam, Dmitriy Rozhkov, Doug Needham, Edgar Manukyan, Erik Swan, Felienne Hermans, George Kampolis, Giel van Lankveld, Greg Wilson, Hay Kranen, Ioannis Cherouvim, Jake Hofman, Jannes Muenchow, Jared Lander, Jay Roaf, Jeffrey Perkel, Jim Hester, Joachim Hagege, Joel Grus, John Cook, John Sandall, Joost Helberg, Joost van Dijk, Joyce Robbins, Julian Hatwell, Karlo Guidoni, Karthik Ram, Lissa Hyacinth, Longhow Lam, Lui Pillmann, Lukas Schmid, Luke Reding, Maarten van Gompel, Martin
📄 Page
19
Braun, Max Schelker, Max Shron, Nathan Furnal, Noah Chase, Oscar Chic, Paige Bailey, Peter Saalbrink, Rich Pauloo, Richard Groot, Rico Huijbers, Rob Doherty, Robbert van Vlijmen, Russell Scudder, Sylvain Lapoix, TJ Lavelle, Tan Long, Thomas Stone, Tim O’Reilly, Vincent Warmerdam, and Yihui Xie. Throughout this book, and especially in the footnotes and appendix, you’ll find hundreds of names. These names belong to the authors of the many tools, books, and other resources on which this book stands. I’m incredibly grateful for their hard work, regardless of whether that work was done 50 years or 50 days ago. Above all, I would like to thank my wife Esther, my daughter Florien, and my son Olivier for reminding me daily what truly matters. I promise it’ll be a few years before I start writing the third edition. Acknowledgments for the First Edition (2014) First of all, I’d like to thank Mike Dewar and Mike Loukides for believing that my blog post, “7 Command-Line Tools for Data Science”, which I wrote in September 2013, could be expanded into a book. Special thanks to my technical reviewers Mike Dewar, Brian Eoff, and Shane Reustle for reading various drafts, meticulously testing all the commands, and providing invaluable feedback. Your efforts have improved the book greatly. Any remaining errors are entirely my own responsibility. I had the privilege of working with three amazing editors: Ann Spencer, Julie Steele, and Marie Beaugureau. Thank you for your guidance and for being such great liaisons with the many talented people at O’Reilly. Those people include Laura
📄 Page
20
Baldwin, Huguette Barriere, Sophia DeMartini, Yasmina Greco, Rachel James, Ben Lorica, Mike Loukides, and Christopher Pappas. There are many others whom I haven’t met because they are operating behind the scenes. Together they ensured that working with O’Reilly has truly been a pleasure. This book discusses more than 80 command-line tools. Needless to say, without these tools, this book wouldn’t have existed in the first place. I’m therefore extremely grateful to all the authors who created and contributed to these tools. The complete list of authors is unfortunately too long to include here; they are mentioned in the Appendix. Thanks especially to Aaron Crow, Jehiah Czebotar, Christoph Groskopf, Dima Kogan, Sergey Lisitsyn, Francisco J. Martin, and Ole Tange for providing help with their amazing command-line tools. Eric Postma and Jaap van den Herik, who supervised me during my PhD program, deserve special thanks. Over the course of five years they taught me many lessons. Although writing a technical book is quite different from writing a PhD thesis, many of those lessons proved to be very helpful in the past nine months as well. Finally, I’d like to thank my colleagues at YPlan, my friends, my family, and especially my wife, Esther, for supporting me and for pulling me away from the command line at just the right times.