Ankur A. Patel & Ajay Uppili Arasanipalai TM Applied Natural Language Processing in the Enterprise Teaching Machines to Read, Write & Understand
(This page has no text content)
Ankur A. Patel and Ajay Uppili Arasanipalai Applied Natural Language Processing in the Enterprise Teaching Machines to Read, Write, and Understand Boston Farnham Sebastopol TokyoBeijing
978-1-492-06257-8 [LSI] Applied Natural Language Processing in the Enterprise by Ankur A. Patel and Ajay Uppili Arasanipalai Copyright © 2021 Human AI Collaboration, Inc. and Taukren, LLC. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jonathan Hassell Development Editor: Melissa Potter Production Editor: Deborah Baker Copyeditor: Kim Cofer Proofreader: Piper Editorial Consulting, LLC Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea June 2021: First Edition Revision History for the First Edition 2021-05-11: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492062578 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Applied Natural Language Processing in the Enterprise, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Part I. Scratching the Surface 1. Introduction to NLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What Is NLP? 4 Popular Applications 5 History 8 Inflection Points 10 A Final Word 11 Basic NLP 12 Defining NLP Tasks 12 Set Up the Programming Environment 17 spaCy, fast.ai, and Hugging Face 17 Perform NLP Tasks Using spaCy 19 Conclusion 33 2. Transformers and Transfer Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Training with fastai 37 Using the fastai Library 37 ULMFiT for Transfer Learning 42 Fine-Tuning a Language Model on IMDb 43 Training a Text Classifier 46 Inference with Hugging Face 48 Loading Models 50 Generating Predictions 52 Conclusion 55 iii
3. NLP Tasks and Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Pretrained Language Models 57 Transfer Learning and Fine-Tuning 59 NLP Tasks 60 Natural Language Dataset 62 Explore the AG Dataset 63 NLP Task #1: Named Entity Recognition 67 Perform Inference Using the Original spaCy Model 67 Custom NER 71 Annotate via Prodigy: NER 72 Train the Custom NER Model Using spaCy 76 Custom NER Model Versus Original NER Model 80 NLP Task #2: Text Classification 83 Annotate via Prodigy: Text Classification 83 Train Text Classification Models Using spaCy 88 Conclusion 92 Part II. The Cogs in the Machine 4. Tokenization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A Minimal Tokenizer 96 Hugging Face Tokenizers 98 Subword Tokenization 100 Building Your Own Tokenizer 102 Conclusion 105 5. Embeddings: How Machines “Understand” Words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Understanding Versus Reading Text 107 Word Vectors 111 Word2Vec 115 Embeddings in the Age of Transfer Learning 117 Embeddings in Practice 117 Preprocessing 118 Model 120 Training 121 Validation 122 Embedding Things That Aren’t Words 123 Making Vectorized Music 125 Some General Tips for Making Custom Embeddings 127 Conclusion 129 iv | Table of Contents
6. Recurrent Neural Networks and Other Sequence Models. . . . . . . . . . . . . . . . . . . . . . . 131 Recurrent Neural Networks 133 RNNs in PyTorch from Scratch 136 Bidirectional RNN 143 Sequence to Sequence Using RNNs 144 Long Short-Term Memory 145 Gated Recurrent Units 147 Conclusion 148 7. Transformers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Building a Transformer from Scratch 151 Attention Mechanisms 153 Dot Product Attention 154 Scaled Dot Product Attention 159 Multi-Head Self-Attention 160 Adaptive Attention Span 162 Persistent Memory/All-Attention 164 Product-Key Memory 167 Transformers for Computer Vision 170 Conclusion 171 8. BERTology: Putting It All Together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 ImageNet 173 The Power of Pretrained Models 174 The Path to NLP’s ImageNet Moment 175 Pretrained Word Embeddings 175 The Limitations of One-Hot Encoding 176 Word2Vec 176 GloVe 178 fastText 178 Context-Aware Pretrained Word Embeddings 179 Sequential Models 179 Sequential Data and the Importance of Sequential Models 181 RNNs 182 Vanilla RNNs 183 LSTM Networks 185 GRUs 187 Attention Mechanisms 188 Transformers 190 Transformer-XL 192 NLP’s ImageNet Moment 193 Universal Language Model Fine-Tuning 193 Table of Contents | v
ELMo 194 BERT 194 BERTology 195 GPT-1, GPT-2, GPT-3 195 Conclusion 196 Part III. Outside the Wall 9. Tools of the Trade. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Deep Learning Frameworks 201 PyTorch 201 TensorFlow 203 Jax 204 Julia 205 Visualization and Experiment Tracking 206 TensorBoard 206 Weights & Biases 207 Neptune 207 Comet 208 MLflow 208 AutoML 209 H2O.ai 210 Dataiku 210 DataRobot 211 ML Infrastructure and Compute 212 Paperspace 212 FloydHub 213 Google Colab 213 Kaggle Kernels 214 Lambda GPU Cloud 214 Edge/On-Device Inference 215 ONNX 216 Core ML 216 Edge Accelerators 216 Cloud Inference and Machine Learning as a Service 217 AWS 217 Microsoft Azure 218 Google Cloud Platform 218 Continuous Integration and Delivery 219 Conclusion 220 vi | Table of Contents
10. Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Our First Streamlit App 222 Build the Streamlit App 222 Deploy the Streamlit App 225 Explore the Streamlit Web App 227 Build and Deploy a Streamlit App for Custom NER 229 Build and Deploy a Streamlit App for Text Classification on AG News Dataset 232 Build and Deploy a Streamlit App for Text Classification on Custom Text 235 Conclusion 236 11. Productionization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Data Scientists, Engineers, and Analysts 239 Prototyping, Deployment, and Maintenance 240 Notebooks and Scripts 241 Databricks: Your Unified Data Analytics Platform 242 Support for Big Data 243 Support for Multiple Programming Languages 244 Support for ML Frameworks 244 Support for Model Repository, Access Control, Data Lineage, and Versioning 245 Databricks Setup 246 Set Up Access to S3 Bucket 250 Set Up Libraries 252 Create Cluster 254 Create Notebook 258 Enable Init Script and Restart Cluster 259 Run Speed Test: Inference on NER Using spaCy 260 Machine Learning Jobs 263 Production Pipeline Notebook 264 Scheduled Machine Learning Jobs 265 Event-Driven Machine Learning Pipeline 267 MLflow 270 Log and Register Model 270 MLflow Model Serving 273 Alternatives to Databricks 282 Amazon SageMaker 282 Saturn Cloud 283 Conclusion 283 12. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Ten Final Lessons 286 Table of Contents | vii
Lesson 1: Start with Simple Approaches First 286 Lesson 2: Leverage the Community 287 Lesson 3: Do Not Create from Scratch, When Possible 288 Lesson 4: Intuition and Experience Trounces Theory 288 Lesson 5: Fight Decision Fatigue 289 Lesson 6: Data Is King 289 Lesson 7: Lean on Humans 290 Lesson 8: Pair Yourself with Really Great Engineers 290 Lesson 9: Ensemble 290 Lesson 10: Have Fun 291 Final Word 291 A. Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 B. CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 viii | Table of Contents
Preface What Is Natural Language Processing? Many of you work with numerical data on a daily basis, either in a spreadsheet pro‐ gram like Microsoft Excel or in a programming environment such as Jupyter Note‐ book. When you work with numbers, you leave the number-crunching up to the computer. There is almost no reason for you not to. Computers are fast and precise with number-crunching, whereas the human brain gets bogged down easily. If asked to calculate 24 × 36 × 48, humans would not hesi‐ tate for a second to pull out a calculator or a computer and let the machines do the heavy lifting. But, when it comes to analyzing textual data, the mighty number-crunching machines have not been so good, historically speaking. Humans use computers to crunch numbers but rely on the human brain to analyze documents with text. To date, this inability to work with text has limited the scope of work machines could handle. This is about to change. In many ways, this change is already well underway. Machines are now able to process text and audio in ways that most humans would have considered magical just two decades ago. Consider just how much you rely on computers to analyze and make sense of textual data in the everyday world around you. Here are several examples: Google Search Search the entire web and surface relevant search results. Google Gmail Auto-complete sentences as you write emails. Google Translate Convert text and audio from one language to another. ix
Amazon Alexa, Apple Siri, Google Assistant, Microsoft Cortana Give voice commands and control your home devices. Customer Service Chatbots Ask account-related questions and get (mostly reasonable) answers. These technologies have become ingrained in our daily lives so gradually and seam‐ lessly that we almost forget just how much we use them day to day. The story of machines being able to work with textual data is just getting started. Over the past few years, there have been pretty dramatic advances in this field, and, over time, we will see computers handle more and more of the work that only humans were capable of doing in the past. Why Should I Read This Book? Natural language processing (NLP) is one of the hottest topics in AI today. Having lagged behind other deep learning fields such as computer vision for years, NLP only recently gained mainstream popularity. Even though Google, Facebook, and OpenAI have open sourced large pretrained language models to make NLP easier, many organizations today still struggle with developing and productionizing NLP applica‐ tions. This hands-on guide helps you learn the field quickly. What Do I Need to Know Already? This book is not for complete beginners. We are going to assume that you already know a bit about machine learning and that you have used Python and libraries such as NumPy, pandas, and matplotlib before. For more on Python, visit the official Python website, and for more on Jupyter Note‐ book, visit the official Jupyter site. For a refresher on college-level calculus, linear algebra, probability, and statistics, read Part I of the textbook Deep Learning (MIT Press) by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. For a refresher on machine learning, read The Elements of Statistical Learning (Springer) by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie. What Is This Book All About? If you have basic-to-intermediate understanding of machine learning and program‐ ming experience with Python, you’ll learn how to build and deploy real-world NLP applications in your organization. We will walk you through the process without bogging you down in theory. After reading this book and practicing on your own, you should be able to do the following: x | Preface
• Understand how state-of-the-art NLP models work. • Learn the tools of the trade, including the most popular frameworks today. • Perform NLP tasks such as text classification, semantic search, and reading comprehension. • Solve problems using new transformer-based models and techniques such as transfer learning. • Develop NLP models with performance comparable or superior to out-of-the- box systems. • Deploy models to production and monitor and maintain their performance • Implement a suite of NLP algorithms using Python and PyTorch. Our book’s goal is to outline the concepts and tools required for you to develop the intuition necessary to apply this technology to everyday problems that you work on. In other words, this is an applied book, one that will allow you to build real-world applications. This book will not have every bit of theory that is relevant to NLP, and you will have to supplement your knowledge in the space using other resources over time, but we will get you started and well underway in this field. The book will use a hands-on approach, introducing some theory but focusing mostly on applying natural language techniques to solving real-world problems. The datasets and code are available online as Jupyter Notebooks on our GitHub repo. How Is This Book Organized? This book is organized into three parts. Part I (Chapters 1–3) These chapters focus on a high-level overview of NLP, including the history of NLP, the most popular applications in the field, and how to use pretrained mod‐ els to perform transfer learning and solve real-world problems quickly. Part II (Chapters 4–8) In these chapters, we’ll dive into the low-level details of NLP including prepro‐ cessing text, tokenization, and word embeddings. While not the sexiest topics, these are foundational to the field of NLP. We then explore the most effective modeling approaches in NLP today such as transformers, attention mechanisms, vanilla recurrent neural networks, long short-term memory (LSTM), and gated recurrent units (GRUs). Finally, we tie everything together to present the water‐ shed year in NLP—the so-called ImageNet moment in 2018 when large, pre‐ trained language models shattered previous performance records and became widely available for use by both researchers and applied engineers. Preface | xi
Part III (Chapters 9–11) Here we’ll cover the most important aspect of applied NLP—how to production‐ ize models that have been developed so the models deliver tangible value to organizations. We discuss the landscape of tools available today, and share our opinions on them. We also cover special topics that are, strictly speaking, not related to NLP but may affect how NLP models are productionized. While we will not be able to cover every NLP topic in this book, including the more advanced topics for seasoned veterans, we will continue to support our community with new and updated material (including code) online via our official book website and GitHub. Please tune in for updates after you finish reading this book! As a side note, it’s worth mentioning that this book was written entirely in Jupyter Notebooks. You can find the code for this book on our GitHub repository. We encourage you to run the experiments in the notebooks as you read to get familiar with implementing the ideas presented in real code (but also because we have omitted some outputs in this book due to space constraints). Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. xii | Preface
This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/nlpbook/nlpbook. If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Applied Natural Lan‐ guage Processing in the Enterprise by Ankur A. Patel and Ajay Uppili Arasanipalai (O’Reilly). Copyright 2021 Human AI Collaboration, Inc. and Taukren, LLC, 978-1-492-06257-8.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. Preface | xiii
O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in- depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) You can access the web page for this book, where we list errata and any additional information, at https://oreil.ly/Applied_NLP_in_the_Enterprise. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and more information about our books and courses, visit our website at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://youtube.com/oreillymedia xiv | Preface
Acknowledgments We would like to thank the entire team at O’Reilly for helping make this project pos‐ sible, starting with Jonathan Hassell for championing and green-lighting this book in the summer of 2019. We want to send a huge shout-out to our editor, Melissa Potter. She really helped us stay on schedule throughout 2020, despite all the challenges of COVID-19. Big thanks to Jeremy Howard for providing valuable advice early on and for sharing the source code for FastDoc, an incredible tool for converting Jupyter Notebooks to AsciiDoc that we used throughout the development process. His work with Rachel Thomas, Sylvain Gugger, Zach Mueller, Hamel Husain, and many fastai contributors to make deep learning accessible and practical has been a huge source of inspiration for this book. Our production editor, Deborah Baker, and the Content Services Manager, Kristen Brown, helped polish this book to its final form with the help of Kim Cofer, David Futato, Karen Montgomery, Kate Dullea, and the teams at Piper Editorial Consulting, LLC, and nSight, Inc. They made the final stretch of the writing process a breeze. Special thanks to Artiom Tarasiuk, Victor Borges, and Benjamin Muskalla for spend‐ ing countless hours reading and reviewing the book and providing critical feedback along the way. We are so grateful for their kinship and generosity in making this project what it is today. Ajay First and foremost, I would like to thank my parents, Uppili and Hema, who have worked tirelessly to support me through a raging pandemic, and my sister, Anika, who I have the highest hopes for. There are many others to whom I owe an immeasurable debt of gratitude. Gayathri Srinivasan, who mentored me all those years ago and was the person kind enough to give a random high-schooler access to a supercomputer that first introduced me to the idea that machines can learn. Ganesan Narayanaswamy, for his generosity in pro‐ viding the computational resources and infrastructure needed to support my research through the OpenPOWER Foundation. Diganta Misra, Trikay Nalamada, Himanshu Arora, and my other collaborators at Landskape, who have spent countless hours run‐ ning experiments and joining me for the 2 AM discussions about attention mecha‐ nisms out of nothing but a shared passion for deep learning and a desire to contribute back to the research community. Their encouragement and enthusiasm for the book and my work in general have been inordinately valuable. Preface | xv
Ankur I am so happy to be part of an incredibly generous and supportive family, to whom I owe everything. I want to thank my parents, Amrat and Ila, for their sacrifices over the years and for investing in me and my education; I simply would not be here doing what I’m doing today without them. I want to thank my sister, Bhavini, and my brother, Jigar, for championing me, always. And, I am so grateful to my beautiful girl‐ friend, Maria Koval, and our golden retriever, Brody, both of whom patiently put up with many late nights and weekends of writing and coding. Thank you! I also want to thank my cofounders at Glean, Howard Katzenberg and Alexander Jia, and my good friend and cofounder at Mellow, Nate Collins, for being incredibly patient and supportive through the entire writing process. I am truly fortunate to have such amazing friends and colleagues—they bring happiness to my life every day. xvi | Preface
PART I Scratching the Surface This first section of the book covers NLP at a high level. This is a somewhat subjective term, so to be more specific, when we say “high level” we mean little to no math and little to no PyTorch code.
(This page has no text content)
Comments 0
Loading comments...
Reply to Comment
Edit Comment