📄 Page
1
(This page has no text content)
📄 Page
2
MINING SOCIAL MEDIA Finding Stories in Internet Data by Lam Thuy Vo San Francisco
📄 Page
3
MINING SOCIAL MEDIA. Copyright © 2020 by Lam Thuy Vo. Some rights reserved. This work is licensed under the Creative Commons Attribution- NonCommercial-ShareAlike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. ISBN-10: 1-59327-916-7 ISBN-13: 978-1-59327-916-5 Publisher: William Pollock Production Editor: Meg Sneeringer Cover Illustration: Gina Redman Developmental Editors: Jan Cash and Alex Freed Technical Reviewer: Melissa Lewis Copyeditor: Rachel Monaghan Compositor: Danielle Foster Proofreader: Emelie Burnette Indexer: Beth Nauman-Montana For information on distribution, translations, or bulk sales, please contact No Starch Press, Inc. directly: No Starch Press, Inc. 245 8th Street, San Francisco, CA 94103 phone: 1.415.863.9900; info@nostarch.com www.nostarch.com Library of Congress Cataloging-in-Publication Data: Names: Vo, Lam Thuy, author. Title: Mining social media : finding stories in Internet data / Lam Thuy Vo. Description: San Francisco : No Starch Press, Inc., 2019. | Includes bibliographical references and index. Identifiers: LCCN 2019030568 (print) | LCCN 2019030569 (ebook) | ISBN 9781593279165 (paperback) | ISBN 9781593279172 (ebook) Subjects: LCSH: Social sciences--Research--Methodology. | Internet research. | Data mining. | Social media--Research. | Quantitative research. | Qualitative research. Classification: LCC H61.95 .V63 2019 (print) | LCC H61.95 (ebook) | DDC 302.23/1072--dc23 LC record available at https://lccn.loc.gov/2019030568 LC ebook record available at https://lccn.loc.gov/2019030569 No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc. Other product and company names mentioned herein may be the trademarks of their respective owners. Rather than use a trademark symbol with every occurrence of a trademarked
📄 Page
4
name, we are using the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The information in this book is distributed on an “As Is” basis, without warranty. While every precaution has been taken in the preparation of this work, neither the author nor No Starch Press, Inc. shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in it.
📄 Page
5
To má Lua, ba Liem, and anh Luan
📄 Page
6
About the Author Lam Thuy Vo is a senior reporter at BuzzFeed News where her area of expertise is the intersection of technology, society, and social media data, and where she covers the spread of misinformation, hatred online, and platform-related accountability. Previously, she led teams and reported for The Wall Street Journal, Al Jazeera America, and NPR’s Planet Money, telling economic stories across the US and throughout Asia. She has also worked as an educator for a decade, developing newsroom-wide training programs, workshops for journalists around the world, and semester-long courses for the Craig Newmark CUNY Graduate School of Journalism. She has also spoken at Pop-Up Magazine, the Tribeca Film Festival’s Interactive Day, and TEDxNYC, among other larger events. About the Technical Reviewer Melissa Lewis is a data reporter for Reveal from The Center for Investigative Reporting. Prior to joining Reveal, she was a data editor at The Oregonian, a data engineer at Simple, a data analyst at Periscopic, and a neuroscience research assistant at Oregon Health & Science University. She is an organizer for PyLadies Portland and the Portland chapter of the Asian American Journalists Association.
📄 Page
7
BRIEF CONTENTS Acknowledgments Introduction Part I: Data Mining Chapter 1: The Programming Languages You’ll Need to Know Chapter 2: Where to Get Your Data Chapter 3: Getting Data with Code Chapter 4: Scraping Your Own Facebook Data Chapter 5: Scraping a Live Site Part II: Data Analysis Chapter 6: Introduction to Data Analysis Chapter 7: Visualizing Your Data Chapter 8: Advanced Tools for Data Analysis Chapter 9: Finding Trends in Reddit Data Chapter 10: Measuring the Twitter Activity of Political Actors Chapter 11: Where to Go from Here Index
📄 Page
8
CONTENTS IN DETAIL Acknowledgments Introduction What Is Data Analysis? Who Is This Book For? Conventions Used in This Book What This Book Covers Part I: Data Mining Part II: Data Analysis Downloading and Installing Python Installing on Windows Installing on macOS Getting Help When You’re Stuck Summary PART I: DATA MINING 1 THE PROGRAMMING LANGUAGES YOU’LL NEED TO KNOW Frontend Languages How HTML Works How CSS Works
📄 Page
9
How JavaScript Works Backend Languages Using Python Getting Started with Python Working with Numbers Working with Strings Storing Values in Variables Storing Multiple Values in Lists Working with Functions Creating Your Own Functions Using Loops Using Conditionals Summary 2 WHERE TO GET YOUR DATA What Is an API? Using an API to Get Data Getting a YouTube API Key Retrieving JSON Objects Using Your Credentials Answering a Research Question Using Data Refining the Data That Your API Returns Summary 3 GETTING DATA WITH CODE
📄 Page
10
Writing Your First Script Running a Script Planning Out a Script Libraries and pip Creating a URL-based API Call Storing Data in a Spreadsheet Converting JSON into a Dictionary Going Back to the Script Running the Finished Script Dealing with API Pagination Templates: How to Make Your Code Reusable Storing Values That Change in Variables Storing Code in a Reusable Function Summary 4 SCRAPING YOUR OWN FACEBOOK DATA Your Data Sources Downloading Your Facebook Data Reviewing the Data and Inspecting the Code Structuring Information as Data Scraping Automatically Analyzing HTML Code to Recognize Patterns Grabbing the Elements You Need Extracting the Contents
📄 Page
11
Writing Data into a Spreadsheet Building Your Rows List Writing to Your .csv File Running the Script Summary 5 SCRAPING A LIVE SITE Messy Data Ethical Considerations for Data Scraping The Robots Exclusion Protocol The Terms of Service Technical Considerations for Data Scraping Reasons for Scraping Data Scraping from a Live Website Analyzing the Page’s Contents Storing the Page Content in Variables Making the Script Reusable Practicing Polite Scraping Summary PART II: DATA ANALYSIS 6 INTRODUCTION TO DATA ANALYSIS The Process of Data Analysis Bot Spotting
📄 Page
12
Getting Started with Google Sheets Modifying and Formatting the Data Aggregating the Data Using Pivot Tables to Summarize Data Using Formulas to Do Math Sorting and Filtering the Data Merging Data Sets Other Ways to Use Google Sheets Summary 7 VISUALIZING YOUR DATA Understanding Our Bot Through Charts Choosing a Chart Specifying a Time Period Making a Chart Conditional Formatting Single-Color Formatting Color Scale Formatting Summary 8 ADVANCED TOOLS FOR DATA ANALYSIS Using Jupyter Notebook Setting Up a Virtual Environment Organizing the Notebook
📄 Page
13
Installing Jupyter and Creating Your First Notebook Working with Cells What Is pandas? Working with Series and Data Frames Reading and Exploring Large Data Files Looking at the Data Viewing Specific Columns and Rows Summary 9 FINDING TRENDS IN REDDIT DATA Clarifying Our Research Objective Outlining a Method Narrowing the Data’s Scope Selecting Data from Specific Columns Handling Null Values Classifying the Data Summarizing the Data Sorting the Data Describing the Data Summary 10 MEASURING THE TWITTER ACTIVITY OF POLITICAL ACTORS Getting Started Setting Up Your Environment
📄 Page
14
Loading the Data into Your Notebook Lambdas Filtering the Data Set Formatting the Data as datetimes Resampling the Data Plotting the Data Summary 11 WHERE TO GO FROM HERE Coding Styles Statistical Analysis Other Kinds of Analyses Conclusion Index
📄 Page
15
ACKNOWLEDGMENTS This is perhaps not a “thank you” but an acknowledgment of the people who once were part of my timeline and in one way or another broke my heart: without the pain there would not have been Quantified Breakup, a tumblr of data visualizations about emotional resiliency as captured through one’s digital footprint. It was this project that essentially propelled my work into new directions—the exploration of social media data and “quantified selfies.” It was also during a talk about this project that the wonderful Jan Cash, an editor at No Starch Press at the time, approached me to write this book. More importantly, there are those who remain in my timeline and who have been exceptionally supportive of me. Thanks to má Lua and ba Liem for making me an empathetic and curious tinkerer, to my brother Luan Vo Nguyen Quang and my sister-in-law Tiffany Talsma for their steadfast and years-long support from across all continents, to Cathy Deng and Jamica El for constant encouragement during my early Python days in the Bay Area, to Julia B. Chan, Lo Benichou, Aaron Williams, Ted Han and Andrew Tran for the camaraderie in an industry full of competitors, to John Wingenter, Adrienne Lopes, Vita Ayala, Mariru Kojima and Toyin Ojih Odutola for providing family far away from family, and to my niece Elynna Quynh Vo who’s the future.
📄 Page
16
INTRODUCTION We experience the social web in brief moments that flash by, often without ever coming back to them. Liking a photo on Instagram, sharing a post that someone published on Facebook, or messaging a friend on WhatsApp—whatever the specific interaction, we do it once and likely don’t think about it after. But from swipes to clicks to status updates, our online lives are being captured by social media companies and used to fill some of the largest data servers in the world. We are producing more data than ever before. By looking at these data points as a whole, we can gain tremendous insight into human behavior. We can also investigate the harm done by these systems, from detecting false online actors (for example, automated bot accounts or fake profiles that seed misinformation) to understanding how algorithms surface questionable content to viewers over time. If we look at these data points collectively, we can find patterns, trends, or anomalies and, hopefully, better understand the ways in which we consume and shape the human experience online. This book aims to help those who want to go from simply observing the social web
📄 Page
17
one post or tweet at a time to understanding it on a larger, more meaningful scale. What Is Data Analysis? The main goal for any data analyst is to gain useful insights from large quantities of information. We can think of data analysis as a way to interview a vast number of records: we may ask about unusual single events, or we may be looking into long-term trends. Interviewing a data set can be a lengthy process with various twists and turns: it might take a few different approaches to find the answers to our questions, the same way it might take a few different meetings to get a good sense of an interviewee. Even if our questions are simple and focused, getting to the answers can still require us to make several logical and philosophical decisions. What data set may be useful to examine our own behavior, and how would we get that data? If we wanted to determine the popularity of a Facebook post, would we measure that in number of reactions (likes, hahas, wows, and so forth), the number of comments it received, or a combination of both metrics? If we wanted to better understand how people discuss a specific topic on Twitter, what would be the best way to categorize tweets about it? So while analyzing data takes a certain amount of technical know- how, it’s also a creative process that requires us to use our judgment in an intentional and informed way. In other words, data analysis is both science and art. Who Is This Book For? This book is written for people who have little to no previous programming experience. Given the huge role of social media, the internet, and technology in all of our lives, this book aims to explore them in an accessible and straightforward way. Through practical
📄 Page
18
exercises, you’ll learn the foundational concepts of programming, data analysis, and the social web. On some level, this book is targeted to someone just like my former self— a person who was fiercely curious about the world but also intimidated by jargon-filled forums, conferences, and online tutorials. We’ll take a macro and micro approach, looking at the ecosystem of the social web as well as the minutiae of writing code. Coding is more than just a way to build a bot or an app: it’s a way to satisfy your curiosity in a world that is increasingly dependent on technology. Conventions Used in This Book To access and understand data from social media, we need to learn where that data is stored, how to access it, and how we can make sense of it. In other words, analyzing data from the web involves multiple steps: gathering the data, researching and exploring it, and analyzing it. In the final step, we’ll also draw conclusions from the data and answer our questions about the human behavior and actions that produced it in the first place. With all that in mind, it’s important to note that this book is not just a compilation of code snippets, ready to be plugged in and used. While it contains scripts that may help you gather and analyze data from the social web, it was first and foremost designed to teach the fundamental concepts and tools of the data analysis process. Think of the chapters as a step-by-step guide for aspiring researchers who are eager to investigate a specific topic or question. My hope is that you’ll come out with the basics you need to start learning and exploring on your own in this field. After all, the landscape of social media is in constant flux, which means that you’ll need to be flexible and continually adapt your analytical approach to understanding the data it produces. Similarly, conventions in this book were chosen and designed to prioritize your learning rather than the elegance of the code. For
📄 Page
19
instance, this code uses a lot of global variables. (Don’t panic! We’ll cover what variables are in the coming chapters.) While this may not be the most efficient way to code, it’s one that’s friendly to people who might be new to Python. As for the tools covered, I had two main criteria. I tried to choose tools that are available for free on the web, and that have a relatively low barrier to entry, allowing beginners to get started with simple projects. What This Book Covers The chapters of this book are structured to follow the journey of a data sleuth. We’ll begin by covering how and where to find data from the social web. After all, we need data before we can go about analyzing it! Then, in the later chapters, you’ll learn about the tools necessary to process, explore, and analyze the data we’ve mined. Part I: Data Mining Chapter 1: The Programming Languages You’ll Need to Know Introduces frontend languages (HTML, CSS, and JavaScript) and why they’re important within the context of social media data mining. You’ll also learn the basics of Python through practical exercises in the interactive shell. Chapter 2: Where to Get Your Data Explains what APIs are and what kind of data you can access through them, and walks you through accessing data in JSON format. This chapter also covers the process of formulating a research question for data analysis. Chapter 3: Getting Data with Code Shows you how to gather the data returned from the YouTube API and use Python to restructure it from JSON to a spreadsheet, specifically a .csv file. Chapter 4: Scraping Your Own Facebook Data Defines scraping and describes how to inspect HTML to structure content from web pages into data. It also covers data archives that social media
📄 Page
20
companies provide to users of their own data and shows you how to extract data into .csv files. Chapter 5: Scraping a Live Site Explains the ethical considerations of scraping websites and walks you through the process of writing a scraper for a Wikipedia page. Part II: Data Analysis Chapter 6: Introduction to Data Analysis Covers the various processes involved in data analyses and introduces Google Sheets by analyzing data from an automated account, or bot. Chapter 7: Visualizing Your Data Explores how visualization tools—like making charts within Google Sheets and using conditional formatting to highlight data variations—can help us better understand our data. Chapter 8: Advanced Tools for Data Analysis Transfers concepts you learned from analyzing data in Google Sheets into the realm of programmatic analysis. You’ll see how to set up virtual environments in Python 3, navigate Jupyter Notebooks (a web application that is capable of reading and running Python code), and use the Python library pandas. You’ll also explore the structure and breadth of your data sets. Chapter 9: Finding Trends in Reddit Data Builds on the previous chapter to show you how to modify data, filter data, and run basic aggregation using functions in pandas. Chapter 10: Measuring the Twitter Activity of Political Actors Explains how to format data as timestamps, modify it more efficiently with lambda functions, and resample it temporally in pandas. Chapter 11: Where to Go from Here Lists resources for becoming a better Python coder, learning more about statistical