Data Science from Scratch First Principles with Python (Joel Grus) (Z-Library)

Author: Joel Grus

技术

No Description

📄 File Format: PDF
💾 File Size: 4.2 MB
21
Views
0
Downloads
0.00
Total Donations

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

📄 Page 1
(This page has no text content)
📄 Page 2
Data Science from Scratch SECOND EDITION First Principles with Python Joel Grus Data Science from Scratch by Joel Grus Copyright © 2019 Joel Grus. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Michele Cronin Production Editor: Deborah Baker Copy Editor: Rachel Monaghan
📄 Page 3
Proofreader: Rachel Head Indexer: Judy McConville Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest April 2015: First Edition May 2019: Second Edition Revision History for the Second Edition 2019-04-10: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492041139 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Science from Scratch, Second Edition, the cover image of a rock ptarmigan, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
📄 Page 4
978-1-492-04113-9 [LSI] Preface to the Second Edition I am exceptionally proud of the first edition of Data Science from Scratch. It turned out very much the book I wanted it to be. But several years of developments in data science, of progress in the Python ecosystem, and of personal growth as a developer and educator have changed what I think a first book in data science should look like. In life, there are no do-overs. In writing, however, there are second editions. Accordingly, I’ve rewritten all the code and examples using Python 3.6 (and many of its newly introduced features, like type annotations). I’ve woven into the book an emphasis on writing clean code. I’ve replaced some of the first edition’s toy examples with more realistic ones using “real” datasets. I’ve added new material on topics such as deep learning, statistics, and natural language processing, corresponding to things that today’s data scientists are likely to be working with. (I’ve also removed some material that seems less relevant.) And I’ve gone over the book with a fine-toothed comb, fixing bugs, rewriting explanations that are less clear than they could be, and freshening up some of the jokes. The first edition was a great book, and this edition is even better. Enjoy! Joel Grus
📄 Page 5
Seattle, WA 2019 Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context. TIP This element signifies a tip or suggestion. NOTE This element signifies a general note. WARNING
📄 Page 6
This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/joelgrus/data-science-from-scratch. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Data Science from Scratch, Second Edition, by Joel Grus (O’Reilly). Copyright 2019 Joel Grus, 978-1-492-04113-9.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning NOTE
📄 Page 7
For almost 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/data- science-from-scratch-2e.
📄 Page 8
To comment or ask technical questions about this book, send email to bookquestions@oreilly.com. For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments First, I would like to thank Mike Loukides for accepting my proposal for this book (and for insisting that I pare it down to a reasonable size). It would have been very easy for him to say, “Who’s this person who keeps emailing me sample chapters, and how do I get him to go away?” I’m grateful he didn’t. I’d also like to thank my editors, Michele Cronin and Marie Beaugureau, for guiding me through the publishing process and getting the book in a much better state than I ever would have gotten it on my own. I couldn’t have written this book if I’d never learned data science, and I probably wouldn’t have learned data science if not for the influence of Dave Hsu, Igor Tatarinov, John Rauser, and the rest of the Farecast gang. (So long ago that it wasn’t even called data science at the time!) The good folks at Coursera and DataTau deserve a lot of credit, too. I am also grateful to my beta readers and reviewers. Jay Fundling found a ton of mistakes and pointed out many unclear explanations, and the book is
📄 Page 9
much better (and much more correct) thanks to him. Debashis Ghosh is a hero for sanity-checking all of my statistics. Andrew Musselman suggested toning down the “people who prefer R to Python are moral reprobates” aspect of the book, which I think ended up being pretty good advice. Trey Causey, Ryan Matthew Balfanz, Loris Mularoni, Núria Pujol, Rob Jefferson, Mary Pat Campbell, Zach Geary, Denise Mauldin, Jimmy O’Donnell, and Wendy Grus also provided invaluable feedback. Thanks to everyone who read the first edition and helped make this a better book. Any errors remaining are of course my responsibility. I owe a lot to the Twitter #datascience commmunity, for exposing me to a ton of new concepts, introducing me to a lot of great people, and making me feel like enough of an underachiever that I went out and wrote a book to compensate. Special thanks to Trey Causey (again), for (inadvertently) reminding me to include a chapter on linear algebra, and to Sean J. Taylor, for (inadvertently) pointing out a couple of huge gaps in the “Working with Data” chapter. Above all, I owe immense thanks to Ganga and Madeline. The only thing harder than writing a book is living with someone who’s writing a book, and I couldn’t have pulled it off without their support. Preface to the First Edition Data Science Data scientist has been called “the sexiest job of the 21st century,”
📄 Page 10
presumably by someone who has never visited a fire station. Nonetheless, data science is a hot and growing field, and it doesn’t take a great deal of sleuthing to find analysts breathlessly prognosticating that over the next 10 years, we’ll need billions and billions more data scientists than we currently have. But what is data science? After all, we can’t produce data scientists if we don’t know what data science is. According to a Venn diagram that is somewhat famous in the industry, data science lies at the intersection of: Hacking skills Math and statistics knowledge Substantive expertise Although I originally intended to write a book covering all three, I quickly realized that a thorough treatment of “substantive expertise” would require tens of thousands of pages. At that point, I decided to focus on the first two. My goal is to help you develop the hacking skills that you’ll need to get started doing data science. And my goal is to help you get comfortable with the mathematics and statistics that are at the core of data science. This is a somewhat heavy aspiration for a book. The best way to learn hacking skills is by hacking on things. By reading this book, you will get a good understanding of the way I hack on things, which may not necessarily be the best way for you to hack on things. You will get a good understanding of some of the tools I use, which will not necessarily be the best tools for you to use. You will get a good understanding of the way I approach data problems, which may not necessarily be the best way for you
📄 Page 11
to approach data problems. The intent (and the hope) is that my examples will inspire you to try things your own way. All the code and data from the book is available on GitHub to get you started. Similarly, the best way to learn mathematics is by doing mathematics. This is emphatically not a math book, and for the most part, we won’t be “doing mathematics.” However, you can’t really do data science without some understanding of probability and statistics and linear algebra. This means that, where appropriate, we will dive into mathematical equations, mathematical intuition, mathematical axioms, and cartoon versions of big mathematical ideas. I hope that you won’t be afraid to dive in with me. Throughout it all, I also hope to give you a sense that playing with data is fun, because, well, playing with data is fun! (Especially compared to some of the alternatives, like tax preparation or coal mining.) From Scratch There are lots and lots of data science libraries, frameworks, modules, and toolkits that efficiently implement the most common (as well as the least common) data science algorithms and techniques. If you become a data scientist, you will become intimately familiar with NumPy, with scikit- learn, with pandas, and with a panoply of other libraries. They are great for doing data science. But they are also a good way to start doing data science without actually understanding data science. In this book, we will be approaching data science from scratch. That means we’ll be building tools and implementing algorithms by hand in order to
📄 Page 12
better understand them. I put a lot of thought into creating implementations and examples that are clear, well commented, and readable. In most cases, the tools we build will be illuminating but impractical. They will work well on small toy datasets but fall over on “web-scale” ones. Throughout the book, I will point you to libraries you might use to apply these techniques to larger datasets. But we won’t be using them here. There is a healthy debate raging over the best language for learning data science. Many people believe it’s the statistical programming language R. (We call those people wrong.) A few people suggest Java or Scala. However, in my opinion, Python is the obvious choice. Python has several features that make it well suited for learning (and doing) data science: It’s free. It’s relatively simple to code in (and, in particular, to understand). It has lots of useful data science–related libraries. I am hesitant to call Python my favorite programming language. There are other languages I find more pleasant, better designed, or just more fun to code in. And yet pretty much every time I start a new data science project, I end up using Python. Every time I need to quickly prototype something that just works, I end up using Python. And every time I want to demonstrate data science concepts in a clear, easy-to-understand way, I end up using Python. Accordingly, this book uses Python. The goal of this book is not to teach you Python. (Although it is nearly
📄 Page 13
certain that by reading this book you will learn some Python.) I’ll take you through a chapter-long crash course that highlights the features that are most important for our purposes, but if you know nothing about programming in Python (or about programming at all), then you might want to supplement this book with some sort of “Python for Beginners” tutorial. The remainder of our introduction to data science will take this same approach—going into detail where going into detail seems crucial or illuminating, at other times leaving details for you to figure out yourself (or look up on Wikipedia). Over the years, I’ve trained a number of data scientists. While not all of them have gone on to become world-changing data ninja rockstars, I’ve left them all better data scientists than I found them. And I’ve grown to believe that anyone who has some amount of mathematical aptitude and some amount of programming skill has the necessary raw materials to do data science. All she needs is an inquisitive mind, a willingness to work hard, and this book. Hence this book. Chapter 1. Introduction “Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.” —Arthur Conan Doyle The Ascendance of Data We live in a world that’s drowning in data. Websites track every user’s
📄 Page 14
every click. Your smartphone is building up a record of your location and speed every second of every day. “Quantified selfers” wear pedometers-on- steroids that are always recording their heart rates, movement habits, diet, and sleep patterns. Smart cars collect driving habits, smart homes collect living habits, and smart marketers collect purchasing habits. The internet itself represents a huge graph of knowledge that contains (among other things) an enormous cross-referenced encyclopedia; domain-specific databases about movies, music, sports results, pinball machines, memes, and cocktails; and too many government statistics (some of them nearly true!) from too many governments to wrap your head around. Buried in these data are answers to countless questions that no one’s ever thought to ask. In this book, we’ll learn how to find them. What Is Data Science? There’s a joke that says a data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician. (I didn’t say it was a good joke.) In fact, some data scientists are —for all practical purposes—statisticians, while others are fairly indistinguishable from software engineers. Some are machine learning experts, while others couldn’t machine-learn their way out of kindergarten. Some are PhDs with impressive publication records, while others have
📄 Page 15
never read an academic paper (shame on them, though). In short, pretty much no matter how you define data science, you’ll find practitioners for whom the definition is totally, absolutely wrong. Nonetheless, we won’t let that stop us from trying. We’ll say that a data scientist is someone who extracts insights from messy data. Today’s world is full of people trying to turn data into insight. For instance, the dating site OkCupid asks its members to answer thousands of questions in order to find the most appropriate matches for them. But it also analyzes these results to figure out innocuous-sounding questions you can ask someone to find out how likely someone is to sleep with you on the first date. Facebook asks you to list your hometown and your current location, ostensibly to make it easier for your friends to find and connect with you. But it also analyzes these locations to identify global migration patterns and where the fanbases of different football teams live. As a large retailer, Target tracks your purchases and interactions, both online and in-store. And it uses the data to predictively model which of its customers are pregnant, to better market baby-related purchases to them. In 2012, the Obama campaign employed dozens of data scientists who data- mined and experimented their way to identifying voters who needed extra attention, choosing optimal donor-specific fundraising appeals and
📄 Page 16
programs, and focusing get-out-the-vote efforts where they were most likely to be useful. And in 2016 the Trump campaign tested a staggering variety of online ads and analyzed the data to find what worked and what didn’t. Now, before you start feeling too jaded: some data scientists also occasionally use their skills for good—using data to make government more effective, to help the homeless, and to improve public health. But it certainly won’t hurt your career if you like figuring out the best way to get people to click on advertisements. Motivating Hypothetical: DataSciencester Congratulations! You’ve just been hired to lead the data science efforts at DataSciencester, the social network for data scientists. NOTE When I wrote the first edition of this book, I thought that “a social network for data scientists” was a fun, silly hypothetical. Since then people have actually created social networks for data scientists, and have raised much more money from venture capitalists than I made from my book. Most likely there is a valuable lesson here about silly data science hypotheticals and/or book publishing. Despite being for data scientists, DataSciencester has never actually invested in building its own data science practice. (In fairness, DataSciencester has never really invested in building its product either.) That will be your job! Throughout the book, we’ll be learning about data science concepts by solving problems that you encounter at work.
📄 Page 17
Sometimes we’ll look at data explicitly supplied by users, sometimes we’ll look at data generated through their interactions with the site, and sometimes we’ll even look at data from experiments that we’ll design. And because DataSciencester has a strong “not-invented-here” mentality, we’ll be building our own tools from scratch. At the end, you’ll have a pretty solid understanding of the fundamentals of data science. And you’ll be ready to apply your skills at a company with a less shaky premise, or to any other problems that happen to interest you. Welcome aboard, and good luck! (You’re allowed to wear jeans on Fridays, and the bathroom is down the hall on the right.) Finding Key Connectors It’s your first day on the job at DataSciencester, and the VP of Networking is full of questions about your users. Until now he’s had no one to ask, so he’s very excited to have you aboard. In particular, he wants you to identify who the “key connectors” are among data scientists. To this end, he gives you a dump of the entire DataSciencester network. (In real life, people don’t typically hand you the data you need. Chapter 9 is devoted to getting data.) What does this data dump look like? It consists of a list of users, each represented by a dict that contains that user’s id (which is a number) and name (which, in one of the great cosmic coincidences, rhymes with the user’s id): users = [
📄 Page 18
{ "id": 0, "name": "Hero" }, { "id": 1, "name": "Dunn" }, { "id": 2, "name": "Sue" }, { "id": 3, "name": "Chi" }, { "id": 4, "name": "Thor" }, { "id": 5, "name": "Clive" }, { "id": 6, "name": "Hicks" }, { "id": 7, "name": "Devin" }, { "id": 8, "name": "Kate" }, { "id": 9, "name": "Klein" } ] He also gives you the “friendship” data, represented as a list of pairs of IDs: friendship_pairs = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4), (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)] For example, the tuple (0, 1) indicates that the data scientist with id 0 (Hero) and the data scientist with id 1 (Dunn) are friends. The network is illustrated in Figure 1-1.
📄 Page 19
Figure 1-1. The DataSciencester network Having friendships represented as a list of pairs is not the easiest way to work with them. To find all the friendships for user 1, you have to iterate over every pair looking for pairs containing 1. If you had a lot of pairs, this would take a long time. Instead, let’s create a dict where the keys are user ids and the values are lists of friend ids. (Looking things up in a dict is very fast.) NOTE Don’t get too hung up on the details of the code right now. In Chapter 2, I’ll take you through a crash course in Python. For now just try to get the general flavor of what we’re doing. We’ll still have to look at every pair to create the dict, but we only have to do that once, and we’ll get cheap lookups after that: # Initialize the dict with an empty list for each user id: friendships = {user["id"]: [] for user in users} # And loop over the friendship pairs to populate it: for i, j in friendship_pairs: friendships[i].append(j) # Add j as a friend of user i
📄 Page 20
friendships[j].append(i) # Add i as a friend of user j Now that we have the friendships in a dict, we can easily ask questions of our graph, like “What’s the average number of connections?” First we find the total number of connections, by summing up the lengths of all the friends lists: def number_of_friends(user): """How many friends does _user_ have?""" user_id = user["id"] friend_ids = friendships[user_id] return len(friend_ids) total_connections = sum(number_of_friends(user) for user in users) # 24 And then we just divide by the number of users: num_users = len(users) # length of the users list avg_connections = total_connections / num_users # 24 / 10 == 2.4 It’s also easy to find the most connected people—they’re the people who have the largest numbers of friends. Since there aren’t very many users, we can simply sort them from “most friends” to “least friends”: # Create a list (user_id, number_of_friends). num_friends_by_id = [(user["id"], number_of_friends(user))
The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00
Total Amount (¥)
0
Donation Count

Login to support the author

Login Now
Back to List