📄 Page
1
(This page has no text content)
📄 Page
2
Deep Learning from Scratch Building with Python from First Principles Seth Weidman
📄 Page
3
Deep Learning from Scratch by Seth Weidman Copyright © 2019 Seth Weidman. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Development Editor: Melissa Potter Acquisitions Editors: Jon Hassell and Mike Loukides Production Editor: Katherine Tozer Copyeditor: Arthur Johnson Proofreader: Rachel Monaghan Indexer: Judith McConville Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest September 2019: First Edition Revision History for the First Edition
📄 Page
4
2019-09-06: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492041412 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Deep Learning from Scratch, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-04141-2 [LSI]
📄 Page
5
Preface If you’ve tried to learn about neural networks and deep learning, you’ve probably encountered an abundance of resources, from blog posts to MOOCs (massive open online courses, such as those offered on Coursera and Udacity) of varying quality and even some books—I know I did when I started exploring the subject a few years ago. However, if you’re reading this preface, it’s likely that each explanation of neural networks that you’ve come across is lacking in some way. I found the same thing when I started learning: the various explanations were like blind men describing different parts of an elephant, but none describing the whole thing. That is what led me to write this book. These existing resources on neural networks mostly fall into two categories. Some are conceptual and mathematical, containing both the drawings one typically finds in explanations of neural networks, of circles connected by lines with arrows on the ends, as well as extensive mathematical explanations of what is going on so you can “understand the theory.” A prototypical example of this is the very good book Deep Learning by Ian Goodfellow et al. (MIT Press). Other resources have dense blocks of code that, if run, appear to show a loss value decreasing over time and thus a neural network “learning.” For instance, the following example from the PyTorch documentation does indeed define and train a simple neural network on randomly generated data: # N is batch size; D_in is input dimension; # H is hidden dimension; D_out is output dimension. N, D_in, H, D_out = 64, 1000, 100, 10 # Create random input and output data x = torch.randn(N, D_in, device=device, dtype=dtype) y = torch.randn(N, D_out, device=device, dtype=dtype)
📄 Page
6
# Randomly initialize weights w1 = torch.randn(D_in, H, device=device, dtype=dtype) w2 = torch.randn(H, D_out, device=device, dtype=dtype) learning_rate = 1e-6 for t in range(500): # Forward pass: compute predicted y h = x.mm(w1) h_relu = h.clamp(min=0) y_pred = h_relu.mm(w2) # Compute and print loss loss = (y_pred - y).pow(2).sum().item() print(t, loss) # Backprop to compute gradients of w1 and w2 with respect to loss grad_y_pred = 2.0 * (y_pred - y) grad_w2 = h_relu.t().mm(grad_y_pred) grad_h_relu = grad_y_pred.mm(w2.t()) grad_h = grad_h_relu.clone() grad_h[h < 0] = 0 grad_w1 = x.t().mm(grad_h) # Update weights using gradient descent w1 -= learning_rate * grad_w1 w2 -= learning_rate * grad_w2 Explanations like this, of course, don’t give much insight into “what is really going on”: the underlying mathematical principles, the individual neural network components contained here and how they work together, and so on. What would a good explanation of neural networks contain? For an answer, it is instructive to look at how other computer science concepts are explained: if you want to learn about sorting algorithms, for example, there are textbooks that will contain: An explanation of the algorithm, in plain English A visual explanation of how the algorithm works, of the kind that you would draw on a whiteboard during a coding interview Some mathematical explanation of “why the algorithm works” 1 2
📄 Page
7
Pseudocode implementing the algorithm One rarely—or never—finds these elements of an explanation of neural networks side by side, even though it seems obvious to me that a proper explanation of neural networks should be done this way; this book is an attempt to fill that gap. Understanding Neural Networks Requires Multiple Mental Models I am not a researcher, and I do not have a Ph.D. I have, however, taught data science professionally: I taught a couple of data science bootcamps with a company called Metis, and then I traveled around the world for a year with Metis doing one- to five-day workshops for companies in many different industries in which I explained machine learning and basic software engineering concepts to their employees. I’ve always loved teaching and have always been fascinated by the question of how best to explain technical concepts, most recently focusing on concepts in machine learning and statistics. With neural networks, I’ve found the most challenging part is conveying the correct “mental model” for what a neural network is, especially since understanding neural networks fully requires not just one but several mental models, all of which illuminate different (but still essential) aspects of how neural networks work. To illustrate this: the following four sentences are all correct answers to the question “What is a neural network?”: A neural network is a mathematical function that takes in inputs and produces outputs. A neural network is a computational graph through which multidimensional arrays flow. A neural network is made up of layers, each of which can be thought of as having a number of “neurons.”
📄 Page
8
A neural network is a universal function approximator that can in theory represent the solution to any supervised learning problem. Indeed, many of you reading this have probably heard one or more of these before, and may have a reasonable understanding of what they mean and what their implications are for how neural networks work. To fully understand them, however, we’ll have to understand all of them and show how they are connected—how is the fact that a neural network can be represented as a computational graph connected to the notion of “layers,” for example? Furthermore, to make all of this precise, we’ll implement all of these concepts from scratch, in Python, and stitch them together to make working neural networks that you can train on your laptop. Nevertheless, despite the fact that we’ll spend a substantial amount of time on implementation details, the purpose of implementing these models in Python is to solidify and make precise our understanding of the concepts; it is not to write as concise or performant of a neural network library as possible. My goal is that after you’ve read this book, you’ll have such a solid understanding of all of these mental models (and their implications for how neural networks should be implemented) that learning related concepts or doing further projects in the field will be much easier.
📄 Page
9
Chapter Outlines The first three chapters are the most important ones and could themselves form a standalone book. 1. In Chapter 1 I’ll show how mathematical functions can be represented as a series of operations linked together to form a computational graph, and show how this representation lets us compute the derivatives of these functions’ outputs with respect to their inputs using the chain rule from calculus. At the end of this chapter, I’ll introduce a very important operation, the matrix multiplication, and show how it can fit into a mathematical function represented in this way while still allowing us to compute the derivatives we’ll end up needing for deep learning. 2. In Chapter 2 we’ll directly use the building blocks we created in Chapter 1 to build and train models to solve a real-world problem: specifically, we’ll use them to build both linear regression and neural network models to predict housing prices on a real-world dataset. I’ll show that the neural network performs better than the linear regression and try to give some intuition for why. The “first principles” approach to building the models in this chapter should give you a very good idea of how neural networks work, but will also show the limited capability of the step-by-step, purely first- principles-based approach to defining deep learning models; this will motivate Chapter 3. 3. In Chapter 3 we’ll take the building blocks from the first- principles-based approach of the first two chapters and use them to build the “higher level” components that make up all deep learning models: Layers, Models, Optimizers, and so on. We’ll end this chapter by training a deep learning model, defined from scratch, on the same dataset from Chapter 2 and showing that it performs better than our simple neural network.
📄 Page
10
4. As it turns out, there are few theoretical guarantees that a neural network with a given architecture will actually find a good solution on a given dataset when trained using the standard training techniques we’ll use in this book. In Chapter 4 we’ll cover the most important “training tricks” that generally increase the probability that a neural network will find a good solution, and, wherever possible, give some mathematical intuition as to why they work. 5. In Chapter 5 I cover the fundamental ideas behind convolutional neural networks (CNNs), a kind of neural network architecture specialized for understanding images. There are many explanations of CNNs out there, so I’ll focus on explaining the absolute essentials of CNNs and how they differ from regular neural networks: specifically, how CNNs result in each layer of neurons being organized into “feature maps,” and how two of these layers (each made up of multiple feature maps) are connected together via convolutional filters. In addition, just as we coded the regular layers in a neural network from scratch, we’ll code convolutional layers from scratch to reinforce our understanding of how they work. 6. Throughout the first five chapters, we’ll build up a miniature neural network library that defines neural networks as a series of Layers—which are themselves made up of a series of Operations —that send inputs forward and gradients backward. This is not how most neural networks are implemented in practice; instead, they use a technique called automatic differentiation. I’ll give a quick illustration of automatic differentiation at the beginning of Chapter 6 and use it to motivate the main subject of the chapter: recurrent neural networks (RNNs), the neural network architecture typically used for understanding data in which the data points appear sequentially, such as time series data or natural language data. I’ll explain the workings of “vanilla RNNs” and of two variants: GRUs and LSTMs (and of course implement all three
📄 Page
11
from scratch); throughout, I’ll be careful to distinguish between the elements that are shared across all of these RNN variants and the specific ways in which these variants differ. 7. Finally, in Chapter 7, I’ll show how everything we did from scratch in Chapters 1–6 can be implemented using the high-performance, open source neural network library PyTorch. Learning a framework like this is essential for progressing your learning about neural networks; but diving in and learning a framework without first having a solid understanding of how and why neural networks work would severely limit your learning in the long term. The goal of the progression of chapters in this book is to give you the power to write extremely high-performance neural networks (by teaching you PyTorch) while still setting you up for long-term learning and success (by teaching you the fundamentals before you learn PyTorch). We’ll conclude with a quick illustration of how neural networks can be used for unsupervised learning. My goal here was to write the book that I wish had existed when I started to learn the subject a few years ago. I hope you will find this book helpful. Onward! Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
📄 Page
12
Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Used for text that should be replaced with user-supplied values or by values determined by context and for comments in code examples. The Pythagorean Theorem is a2 + b2 = c2. NOTE This element signifies a general note. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at the book’s GitHub repository. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Deep Learning from Scratch by Seth Weidman (O’Reilly). Copyright 2019 Seth Weidman, 978-1-492-04141-2.”
📄 Page
13
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning NOTE For almost 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax)
📄 Page
14
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/dl-from- scratch. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments I’d like to thank my editor, Melissa Potter, along with the team at O’Reilly, who were meticulous with their feedback and responsive to my questions throughout the process. I’d like to give a special thanks to several people whose work to make technical concepts in machine learning accessible to a wider audience has directly influenced me, and a couple of whom I’ve been lucky enough to have gotten to know personally: in a randomly generated order, these people are Brandon Rohrer, Joel Grus, Jeremy Watt, and Andrew Trask. I’d like to thank my boss at Metis and my director at Facebook, who were unreasonably supportive of my carving out time to work on this project. I’d like to give a special thank you and acknowledgment to Mat Leonard, who was my coauthor for a brief period of time before we decided to go our separate ways. Mat helped organize the code for the minilibrary associated with the book—lincoln—and gave me very helpful feedback on some extremely unpolished versions of the first two chapters, writing his own versions of large sections of these chapters in the process.
📄 Page
15
Finally, I’d like to thank my friends Eva and John, both of whom directly encouraged and inspired me to take the plunge and actually start writing. I’d also like to thank my many friends in San Francisco who tolerated my general preoccupation and worry about the book as well as my lack of availability to hang out for many months, and who were unwaveringly supportive when I needed them to be. 1 To be fair, this example was intended as an illustration of the PyTorch library for those who already understand neural networks, not as an instructive tutorial. Still, many tutorials follow this style, showing only the code along with some brief explanations. 2 Specifically, in the case of sorting algorithms, why the algorithm terminates with a properly sorted list.
📄 Page
16
Chapter 1. Foundations Don’t memorize these formulas. If you understand the concepts, you can invent your own notation. —John Cochrane, Investments Notes 2006 The aim of this chapter is to explain some foundational mental models that are essential for understanding how neural networks work. Specifically, we’ll cover nested mathematical functions and their derivatives. We’ll work our way up from the simplest possible building blocks to show that we can build complicated functions made up of a “chain” of constituent functions and, even when one of these functions is a matrix multiplication that takes in multiple inputs, compute the derivative of the functions’ outputs with respect to their inputs. Understanding how this process works will be essential to understanding neural networks, which we technically won’t begin to cover until Chapter 2. As we’re getting our bearings around these foundational building blocks of neural networks, we’ll systematically describe each concept we introduce from three perspectives: Math, in the form of an equation or equations Code, with as little extra syntax as possible (making Python an ideal choice) A diagram explaining what is going on, of the kind you would draw on a whiteboard during a coding interview As mentioned in the preface, one of the challenges of understanding neural networks is that it requires multiple mental models. We’ll get a sense of that in this chapter: each of these three perspectives excludes certain essential features of the concepts we’ll cover, and only when taken together do they provide a full picture of both how and why nested mathematical functions work the way they do. In fact, I take the uniquely strong view that any attempt to explain the building blocks of neural networks that excludes one of these three perspectives is incomplete. With that out of the way, it’s time to take our first steps. We’re going to start with some extremely simple building blocks to illustrate how we can understand different concepts in terms of these three perspectives. Our first building block will be a simple but critical concept: the function. Functions What is a function, and how do we describe it? As with neural nets, there are several ways to describe functions, none of which individually paints a complete picture. Rather than trying to give a pithy one-sentence description, let’s simply walk through the three mental models one by one, playing the role of the blind men feeling different parts of the elephant. Math Here are two examples of functions, described in mathematical notation: f (x) = x f (x) = max(x, 0) This notation says that the functions, which we arbitrarily call f and f , take in a number x as input and transform it into either x (in the first case) or max(x, 0) (in the second case). Diagrams One way of depicting functions is to: 1 2 2 1 2 2
📄 Page
17
1. Draw an x-y plane (where x refers to the horizontal axis and y refers to the vertical axis). 2. Plot a bunch of points, where the x-coordinates of the points are (usually evenly spaced) inputs of the function over some range, and the y-coordinates are the outputs of the function over that range. 3. Connect these plotted points. This was first done by the French philosopher René Descartes, and it is extremely useful in many areas of mathematics, in particular calculus. Figure 1-1 shows the plot of these two functions. Figure 1-1. Two continuous, mostly differentiable functions However, there is another way to depict functions that isn’t as useful when learning calculus but that will be very useful for us when thinking about deep learning models. We can think of functions as boxes that take in numbers as input and produce numbers as output, like minifactories that have their own internal rules for what happens to the input. Figure 1-2 shows both these functions described as general rules and how they operate on specific inputs. Figure 1-2. Another way of looking at these functions Code Finally, we can describe these functions using code. Before we do, we should say a bit about the Python library on top of which we’ll be writing our functions: NumPy.
📄 Page
18
Code caveat #1: NumPy NumPy is a widely used Python library for fast numeric computation, the internals of which are mostly written in C. Simply put: the data we deal with in neural networks will always be held in a multidimensional array that is almost always either one-, two-, three-, or four-dimensional, but especially two- or three-dimensional. The ndarray class from the NumPy library allows us to operate on these arrays in ways that are both (a) intuitive and (b) fast. To take the simplest possible example: if we were storing our data in Python lists (or lists of lists), adding or multiplying the lists elementwise using normal syntax wouldn’t work, whereas it does work for ndarrays: print("Python list operations:") a = [1,2,3] b = [4,5,6] print("a+b:", a+b) try: print(a*b) except TypeError: print("a*b has no meaning for Python lists") print() print("numpy array operations:") a = np.array([1,2,3]) b = np.array([4,5,6]) print("a+b:", a+b) print("a*b:", a*b) Python list operations: a+b: [1, 2, 3, 4, 5, 6] a*b has no meaning for Python lists numpy array operations: a+b: [5 7 9] a*b: [ 4 10 18] ndarrays also have several features you’d expect from an n-dimensional array; each ndarray has n axes, indexed from 0, so that the first axis is 0, the second is 1, and so on. In particular, since we deal with 2D ndarrays often, we can think of axis = 0 as the rows and axis = 1 as the columns—see Figure 1-3. Figure 1-3. A 2D NumPy array, with axis = 0 as the rows and axis = 1 as the columns NumPy’s ndarrays also support applying functions along these axes in intuitive ways. For example, summing along axis 0 (the rows for a 2D array) essentially “collapses the array” along that axis, returning an array with one less dimension than the original array; for a 2D array, this is equivalent to summing each column: print('a:') print(a) print('a.sum(axis=0):', a.sum(axis=0)) print('a.sum(axis=1):', a.sum(axis=1))
📄 Page
19
a: [[1 2] [3 4]] a.sum(axis=0): [4 6] a.sum(axis=1): [3 7] Finally, NumPy ndarrays support adding a 1D array to the last axis; for a 2D array a with R rows and C columns, this means we can add a 1D array b of length C and NumPy will do the addition in the intuitive way, adding the elements to each row of a: a = np.array([[1,2,3], [4,5,6]]) b = np.array([10,20,30]) print("a+b:\n", a+b) a+b: [[11 22 33] [14 25 36]] Code caveat #2: Type-checked functions As I’ve mentioned, the primary goal of the code we write in this book is to make the concepts I’m explaining precise and clear. This will get more challenging as the book goes on, as we’ll be writing functions with many arguments as part of complicated classes. To combat this, we’ll use functions with type signatures throughout; for example, in Chapter 3, we’ll initialize our neural networks as follows: def __init__(self, layers: List[Layer], loss: Loss, learning_rate: float = 0.01) -> None: This type signature alone gives you some idea of what the class is used for. By contrast, consider the following type signature that we could use to define an operation: def operation(x1, x2): This type signature by itself gives you no hint as to what is going on; only by printing out each object’s type, seeing what operations get performed on each object, or guessing based on the names x1 and x2 could we understand what is going on in this function. I can instead define a function with a type signature as follows: def operation(x1: ndarray, x2: ndarray) -> ndarray: You know right away that this is a function that takes in two ndarrays, probably combines them in some way, and outputs the result of that combination. Because of the increased clarity they provide, we’ll use type-checked functions throughout this book. Basic functions in NumPy With these preliminaries in mind, let’s write up the functions we defined earlier in NumPy: def square(x: ndarray) -> ndarray: ''' Square each element in the input ndarray. ''' return np.power(x, 2) def leaky_relu(x: ndarray) -> ndarray: ''' Apply "Leaky ReLU" function to each element in ndarray. 1
📄 Page
20
''' return np.maximum(0.2 * x, x) NOTE One of NumPy’s quirks is that many functions can be applied to ndarrays either by writing np.function_name(ndarray) or by writing ndarray.function_name. For example, the preceding relu function could be written as: x.clip(min=0). We’ll try to be consistent and use the np.function_name(ndarray) convention throughout—in particular, we’ll avoid tricks such as ndarray.T for transposing a two- dimensional ndarray, instead writing np.transpose(ndarray, (1, 0)). If you can wrap your mind around the fact that math, a diagram, and code are three different ways of representing the same underlying concept, then you are well on your way to displaying the kind of flexible thinking you’ll need to truly understand deep learning. Derivatives Derivatives, like functions, are an extremely important concept for understanding deep learning that many of you are probably familiar with. Also like functions, they can be depicted in multiple ways. We’ll start by simply saying at a high level that the derivative of a function at a point is the “rate of change” of the output of the function with respect to its input at that point. Let’s now walk through the same three perspectives on derivatives that we covered for functions to gain a better mental model for how derivatives work. Math First, we’ll get mathematically precise: we can describe this number—how much the output of f changes as we change its input at a particular value a of the input—as a limit: (a) = lim Δ→0 This limit can be approximated numerically by setting a very small value for Δ, such as 0.001, so we can compute the derivative as: (a) = While accurate, this is only one part of a full mental model of derivatives. Let’s look at them from another perspective: a diagram. Diagrams First, the familiar way: if we simply draw a tangent line to the Cartesian representation of the function f, the derivative of f at a point a is just the slope of this line at a. As with the mathematical descriptions in the prior subsection, there are two ways we can actually calculate the slope of this line. The first would be to use calculus to actually calculate the limit. The second would be to just take the slope of the line connecting f at a – 0.001 and a + 0.001. The latter method is depicted in Figure 1-4 and should be familiar to anyone who has taken calculus. df du f(a + Δ)−f(a − Δ) 2 × Δ df du f(a + 0.001) − f(a − 0.001) 0.002