M A N N I N G Ashley Davis
Data Wrangling with JavaScript
(This page has no text content)
Data Wrangling with JavaScript ASHLEY DAVIS MANN I NG Shelter ISland
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2019 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid- free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. ∞ Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Development editor: Helen Stergius Technical development editor: Luis Atencio Review editor: Ivan Martinović Project manager: Deirdre Hiam Copy editor: Katie Petito Proofreader: Charles Hutchinson Technical proofreader: Kathleen Estrada Typesetting: Happenstance Type-O-Rama Cover designer: Marija Tudor ISBN 9781617294846 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – SP – 23 22 21 20 19 18
v brief contents 1 ■ Getting started: establishing your data pipeline 1 2 ■ Getting started with Node.js 25 3 ■ Acquisition, storage, and retrieval 59 4 ■ Working with unusual data 99 5 ■ Exploratory coding 115 6 ■ Clean and prepare 143 7 ■ Dealing with huge data files 168 8 ■ Working with a mountain of data 191 9 ■ Practical data analysis 217 10 ■ Browser-based visualization 247 11 ■ Server-side visualization 274 12 ■ Live data 299 13 ■ Advanced visualization with D3 329 14 ■ Getting to production 358
(This page has no text content)
vii contents preface xv acknowledgments xvii about this book xix about the author xxiii about the cover illustration xxv 1 Getting started: establishing your data pipeline 1 1.1 Why data wrangling? 1 1.2 What’s data wrangling? 2 1.3 Why a book on JavaScript data wrangling? 3 1.4 What will you get out of this book? 4 1.5 Why use JavaScript for data wrangling? 5 1.6 Is JavaScript appropriate for data analysis? 6 1.7 Navigating the JavaScript ecosystem 7 1.8 Assembling your toolkit 7 1.9 Establishing your data pipeline 8 Setting the stage 9 ■ The data-wrangling process 10 ■ Planning 10 ■ Acquisition, storage, and retrieval 13 ■ Exploratory coding 15 ■ Clean and prepare 18 ■ Analysis 19 ■ Visualization 20 ■ Getting to production 22
viii CONTENTS 2 Getting started with Node.js 25 2.1 Starting your toolkit 26 2.2 Building a simple reporting system 27 2.3 Getting the code and data 27 Viewing the code 28 ■ Downloading the code 28 ■ Installing Node.js 29 ■ Installing dependencies 29 ■ Running Node.js code 29 ■ Running a web application 30 ■ Getting the data 30 ■ Getting the code for chapter 2 31 2.4 Installing Node.js 31 Checking your Node.js version 32 2.5 Working with Node.js 33 Creating a Node.js project 33 ■ Creating a command-line application 36 ■ Creating a code library 38 ■ Creating a simple web server 40 2.6 Asynchronous coding 45 Loading a single file 46 ■ Loading multiple files 49 ■ Error handling 51 ■ Asynchronous coding with promises 52 ■ Wrapping asynchronous operations in promises 55 ■ Async coding with “async” and “await” 57 3 Acquisition, storage, and retrieval 59 3.1 Building out your toolkit 60 3.2 Getting the code and data 61 3.3 The core data representation 61 The earthquakes website 62 ■ Data formats covered 64 Power and flexibility 65 3.4 Importing data 66 Loading data from text files 66 ■ Loading data from a REST API 69 ■ Parsing JSON text data 70 ■ Parsing CSV text data 74 ■ Importing data from databases 78 ■ Importing data from MongoDB 78 ■ Importing data from MySQL 82 3.5 Exporting data 85 You need data to export! 85 ■ Exporting data to text files 85 ■ Exporting data to JSON text files 87 Exporting data to CSV text files 89 ■ Exporting data to a database 90 ■ Exporting data to MongoDB 91 Exporting data to MySQL 92
ix CONTENTS 3.6 Building complete data conversions 95 3.7 Expanding the process 95 4 Working with unusual data 99 4.1 Getting the code and data 100 4.2 Importing custom data from text files 101 4.3 Importing data by scraping web pages 104 Identifying the data to scrape 104 ■ Scraping with Cheerio 105 4.4 Working with binary data 107 Unpacking a custom binary file 108 ■ Packing a custom binary file 111 ■ Replacing JSON with BSON 113 ■ Converting JSON to BSON 113 ■ Deserializing a BSON file 114 5 Exploratory coding 115 5.1 Expanding your toolkit 116 5.2 Analyzing car accidents 116 5.3 Getting the code and data 117 5.4 Iteration and your feedback loop 117 5.5 A first pass at understanding your data 118 5.6 Working with a reduced data sample 120 5.7 Prototyping with Excel 120 5.8 Exploratory coding with Node.js 122 Using Nodemon 123 ■ Exploring your data 125 Using Data-Forge 128 ■ Computing the trend column 130 ■ Outputting a new CSV file 134 5.9 Exploratory coding in the browser 135 5.10 Putting it all together 141 6 Clean and prepare 143 6.1 Expanding our toolkit 144 6.2 Preparing the reef data 145 6.3 Getting the code and data 145 6.4 The need for data cleanup and preparation 145 6.5 Where does broken data come from? 145
x CONTENTS 6.6 How does data cleanup fit into the pipeline? 146 6.7 Identifying bad data 147 6.8 Kinds of problems 148 6.9 Responses to bad data 148 6.10 Techniques for fixing bad data 149 6.11 Cleaning our data set 150 Rewriting bad rows 150 ■ Filtering rows of data 155 ■ Filtering columns of data 158 6.12 Preparing our data for effective use 159 Aggregating rows of data 159 ■ Combining data from different files using globby 161 ■ Splitting data into separate files 163 6.13 Building a data processing pipeline with Data-Forge 165 7 Dealing with huge data files 168 7.1 Expanding our toolkit 169 7.2 Fixing temperature data 169 7.3 Getting the code and data 170 7.4 When conventional data processing breaks down 171 7.5 The limits of Node.js 172 Incremental data processing 172 ■ Incremental core data representation 173 ■ Node.js file streams basics primer 174 ■ Transforming huge CSV files 178 Transforming huge JSON files 184 ■ Mix and match 190 8 Working with a mountain of data 191 8.1 Expanding our toolkit 192 8.2 Dealing with a mountain of data 192 8.3 Getting the code and data 193 8.4 Techniques for working with big data 193 Start small 193 ■ Go back to small 193 ■ Use a more efficient representation 194 ■ Prepare your data offline 194 8.5 More Node.js limitations 195 8.6 Divide and conquer 196
xi CONTENTS 8.7 Working with large databases 196 Database setup 197 ■ Opening a connection to the database 198 ■ Moving large files to your database 199 ■ Incremental processing with a database cursor 201 ■ Incremental processing with data windows 202 ■ Creating an index 205 ■ Filtering using queries 205 ■ Discarding data with projection 207 ■ Sorting large data sets 208 8.8 Achieving better data throughput 210 Optimize your code 210 ■ Optimize your algorithm 210 ■ Processing data in parallel 210 9 Practical data analysis 217 9.1 Expanding your toolkit 218 9.2 Analyzing the weather data 219 9.3 Getting the code and data 219 9.4 Basic data summarization 220 Sum 220 ■ Average 221 ■ Standard deviation 221 9.5 Group and summarize 222 9.6 The frequency distribution of temperatures 227 9.7 Time series 231 Yearly average temperature 231 ■ Rolling average 233 ■ Rolling standard deviation 236 ■ Linear regression 236 ■ Comparing time series 239 ■ Stacking time series operations 242 9.8 Understanding relationships 243 Detecting correlation with a scatter plot 243 ■ Types of correlation 243 ■ Determining the strength of the correlation 244 ■ Computing the correlation coefficient 245 10 Browser-based visualization 247 10.1 Expanding your toolkit 248 10.2 Getting the code and data 248 10.3 Choosing a chart type 249
xii CONTENTS 10.4 Line chart for New York City temperature 250 The most basic C3 line chart 251 ■ Adding real data 253 ■ Parsing the static CSV file 254 ■ Adding years as the X axis 256 ■ Creating a custom Node.js web server 258 ■ Adding another series to the chart 261 ■ Adding a second Y axis to the chart 263 ■ Rendering a time series chart 264 10.5 Other chart types with C3 266 Bar chart 266 ■ Horizontal bar chart 267 ■ Pie chart 267 ■ Stacked bar chart 269 ■ Scatter plot chart 270 10.6 Improving the look of our charts 271 10.7 Moving forward with your own projects 272 11 Server-side visualization 274 11.1 Expanding your toolkit 275 11.2 Getting the code and data 276 11.3 The headless browser 276 11.4 Using Nightmare for server-side visualization 278 Why Nightmare? 278 ■ Nightmare and Electron 278 ■ Our process: capturing visualizations with Nightmare 279 ■ Prepare a visualization to render 280 ■ Starting the web server 282 ■ Procedurally start and stop the web server 283 ■ Rendering the web page to an image 284 ■ Before we move on . . . 285 ■ Capturing the full visualization 287 ■ Feeding the chart with data 289 ■ Multipage reports 292 ■ Debugging code in the headless browser 294 ■ Making it work on a Linux server 296 11.5 You can do much more with a headless browser 297 Web scraping 297 ■ Other uses 298 12 Live data 299 12.1 We need an early warning system 300 12.2 Getting the code and data 301 12.3 Dealing with live data 301 12.4 Building a system for monitoring air quality 302 12.5 Set up for development 304
xiii CONTENTS 12.6 Live-streaming data 305 HTTP POST for infrequent data submission 305 ■ Sockets for high-frequency data submission 308 12.7 Refactor for configuration 310 12.8 Data capture 312 12.9 An event-based architecture 314 12.10 Code restructure for event handling 316 Triggering SMS alerts 317 ■ Automatically generating a daily report 318 12.11 Live data processing 321 12.12 Live visualization 322 13 Advanced visualization with D3 329 13.1 Advanced visualization 330 13.2 Getting the code and data 331 13.3 Visualizing space junk 331 13.4 What is D3? 332 13.5 The D3 data pipeline 333 13.6 Basic setup 334 13.7 SVG crash course 335 SVG circle 335 ■ Styling 337 ■ SVG text 337 SVG group 338 13.8 Building visualizations with D3 339 Element state 339 ■ Selecting elements 340 ■ Manually adding elements to our visualization 342 ■ Scaling to fit 344 ■ Procedural generation the D3 way 346 ■ Loading a data file 349 ■ Color-coding the space junk 351 ■ Adding interactivity 352 ■ Adding a year-by-year launch animation 353 14 Getting to production 358 14.1 Production concerns 359 14.2 Taking our early warning system to production 360 14.3 Deployment 361 14.4 Monitoring 364
xiv CONTENTS 14.5 Reliability 366 System longevity 366 ■ Practice defensive programming 366 ■ Data protection 367 ■ Testing and automation 368 ■ Handling unexpected errors 372 ■ Designing for process restart 374 ■ Dealing with an ever-growing database 375 14.6 Security 375 Authentication and authorization 376 ■ Privacy and confidentiality 376 ■ Secret configuration 378 14.7 Scaling 378 Measurement before optimization 378 ■ Vertical scaling 379 ■ Horizontal scaling 379 appendix A JavaScript cheat sheet 383 appendix B Data-Forge cheat sheet 387 appendix C Getting started with Vagrant 389 index 393
xv preface Data is all around us and growing at an ever-increasing rate. It’s more important than ever before for businesses to deal with data quickly and effectively to understand their customers, monitor their processes, and support decision-making. If Python and R are the kings of the data world, why, then, should you use JavaScript instead? What role does it play in business, and why do you need to read Data Wrangling with JavaScript? I’ve used JavaScript myself in various situations. I started with it when I was a game developer building our UIs with web technologies. I soon graduated to Node.js back- ends to manage collection and processing of metrics and telemetry. We also created analytics dashboards to visualize the data we collected. By this stage we did full-stack JavaScript to support the company’s products. My job at the time was creating game-like 3D simulations of construction and engi- neering projects, so we also dealt with large amounts of data from construction logistics, planning, and project schedules. I naturally veered toward JavaScript for wrangling and analysis of the data that came across my desk. For a sideline, I was also algorithmically analyzing and trading stocks, something that data analysis is useful for! Exploratory coding in JavaScript allowed me to explore, transform, and analyze my data, but at the same time I was producing useful code that could later be rolled out to our production environment. This seems like a productivity win. Rather than using Python and then having to rewrite parts of it in JavaScript, I did it all in JavaScript. This might seem like the obvious choice to you, but at the time the typical wisdom was telling me that this kind of work should be done in Python. Because there wasn’t much information or many resources out there, I had to learn this stuff for myself, and I learned it the hard way. I wanted to write this book to document what I learned, and I hope to make life a bit easier for those who come after me.
xvi PREFACE In addition, I really like working in JavaScript. I find it to be a practical and capable language with a large ecosystem and an ever-growing maturity. I also like the fact that JavaScript runs almost everywhere these days: ¡ Server ✓ ¡ Browser ✓ ¡ Mobile ✓ ¡ Desktop ✓ My dream (and the promise of JavaScript) was to write code once and run it in any kind of app. JavaScript makes this possible to a large extent. Because JavaScript can be used almost anywhere and for anything, my goal in writing this book is to add one more purpose: ¡ Data wrangling and analysis ✓
xvii acknowledgments In Data Wrangling with JavaScript I share my years of hard-won experience with you. Such experience wouldn’t be possible without having worked for and with a broad range of people and companies. I’d especially like to thank one company, the one where I started using JavaScript, started my data-wrangling journey in JavaScript, learned much, and had many growth experiences. Thanks to Real Serious Games for giving me that opportunity. Thank you to Manning, who have made this book possible. Thanks especially to Helen Stergius, who was very patient with this first-time author and all the mistakes I’ve made. She was instrumental in helping draw this book out of my brain. Also, a thank you to the entire Manning team for all their efforts on the project: Cheryl Weisman, Deirdre Hiam, Katie Petito, Charles Hutchinson, Nichole Beard, Mike Stephens, Mary Piergies, and Marija Tudor. Thanks also go to my reviewers, especially Artem Kulakov and Sarah Smith, friends of mine in the industry who read the book and gave feedback. Ultimately, their encour- agement helped provide the motivation I needed to get it finished. In addition, I’d like to thank all the reviewers: Ahmed Chicktay, Alex Basile, Alex Jacinto, Andriy Kharchuk, Arun Lakkakula, Bojan Djurkovic, Bryan Miller, David Blu- baugh, David Krief, Deepu Joseph, Dwight Wilkins, Erika L. Bricker, Ethan Rivett, Ger- ald Mack, Harsh Raval, James Wang, Jeff Switzer, Joseph Tingsanchali, Luke Greenleaf, Peter Perlepes, Rebecca Jones, Sai Ram Kota, Sebastian Maier, Sowmya Vajjala, Ubaldo Pescatore, Vlad Navitski, and Zhenyang Hua. Special thanks also to Kathleen Estrada, the technical proofreader. Big thanks also go to my partner, Antonella, without whose support and encourage- ment this book wouldn’t have happened.
xviii ACKNOWLEDGMENTS Finally, I’d like to say thank you to the JavaScript community—to anyone who works for the better of the community and ecosystem. It’s your participation that has made JavaScript and its environment such an amazing place to work. Working together, we can move JavaScript forward and continue to build its reputation. We’ll evolve and improve the JavaScript ecosystem for the benefit of all.
xix about this book The world of data is big, and it can be difficult to navigate on your own. Let Data Wran- gling with JavaScript be your guide to working with data in JavaScript. Data Wrangling with JavaScript is a practical, hands-on, and extensive guide to working with data in JavaScript. It describes the process of development in detail—you’ll feel like you’re actually doing the work yourself as you read the book. The book has a broad coverage of tools, techniques, and design patterns that you need to be effective with data in JavaScript. Through the book you’ll learn how to apply these skills and build a functioning data pipeline that includes all stages of data wran- gling, from data acquisition through to visualization. This book can’t cover everything, because it’s a broad subject in an evolving field, but one of the main aims of this book is to help you build and manage your own toolkit of data-wrangling tools. Not only will you be able to build a data pipeline after reading this book, you’ll also be equipped to navigate this complex and growing ecosystem, to evaluate the many tools and libraries out there that can help bootstrap or extend your system and get your own development moving more quickly. Who should read this book This book is aimed at intermediate JavaScript developers who want to up-skill in data wrangling. To get the most of this book, you should already be comfortable working in one of the popular JavaScript development platforms, such as browser, Node.js, Elec- tron, or Ionic. How much JavaScript do you need to know? Well, you should already know basic syn- tax and how to use JavaScript anonymous functions. This book uses the concise arrow function syntax in Node.js code and the traditional syntax (for backward compatibility) in browser-based code.
Comments 0
Loading comments...
Reply to Comment
Edit Comment