📄 Page
1
Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta & Harshit Surana Practical Natural Language Processing A Comprehensive Guide to Building Real-World NLP Systems
📄 Page
2
(This page has no text content)
📄 Page
3
Praise for Practical Natural Language Processing Practical NLP focuses squarely on an overlooked demographic: the practitioners and business leaders in industry! While many great books focus on ML’s algorithmic fundamentals, this book exposes the anatomy of real-world systems: from e-commerce applications to virtual assistants. Painting a realistic picture of modern production systems, the book teaches not only deep learning, but also the heuristics and patchwork pipelines that define the (actual) state of the art for deployed NLP systems. The authors zoom out, teaching problem formulation, and aren’t afraid to zoom in on the grimy details, including handling messy data and sustaining live systems. This book will prove invaluable to industry professionals keen to build and deploy NLP in the wild. —Zachary Lipton, Assistant Professor, Carnegie Mellon University, Scientist at Amazon AI, Author of Dive into Deep Learning This book does a great job bridging the gap between natural language processing (NLP) research and practical applications. From healthcare to e-commerce and finance, it covers many of the most sought-after domains where NLP is being put to use and walks through core tasks in a clear and understandable manner. Overall, the book is a great manual on how to get the most out of current NLP in your industry. —Sebastian Ruder, Research Scientist, Google DeepMind There are two kinds of computer science books on the market: academic textbooks that give you a deep understanding of a domain but can be difficult to access for a non-academic, and “cookbooks” that outline solutions to very specific problems without providing the technical foundations that would allow the reader to generalize the recipes. This book offers the best of both worlds: it is thorough yet accessible. It provides the reader with a solid foundation in natural-language processing. . . . If you would like to go from zero to one in NLP, this book is for you! —Marc Najork, Research Engineering Director, Google AI, ACM & IEEE Fellow
📄 Page
4
There are text books or research papers or books on programming tips, but not a book that tells us how to build an end-to-end NLP system from scratch. I am happy to see this book on practical NLP, which fills this much needed gap. The authors have meticulously, thoughtfully and lucidly covered each and every aspect of NLP that one has to be aware of while building large scale practical systems; at the same time, this book has also managed to cover a large number of examples and varied application areas and verticals. This book is a must for all aspiring NLP engineers, entrepreneurs who want to build companies around language technologies, and also academic researchers who would like to see their inventions reach the real users. —Monojit, Principal Researcher, Microsoft Research India, Adjunct Faculty at IIIT Hyderabad, Ashoka University, IIT Kharagpur This book bridges the gap between theory and practice by explaining the underlying concepts while keeping in mind varied real-world deployments across different business verticals. There is much hard-fought practical advice from the trenches whether it is about tweaking parameters of open source libraries, setting up data pipelines for building models, or optimizing for fast inference. A must-read for engineers building NLP applications. —Vinayak Hegde, CTO-in-Residence, Microsoft For Startups This book shows how to put NLP to practice. It bridges the gap between NLP theory and practical engineering. The authors achieved a rare feat by simplifying the esoteric art of design and architecture of production quality machine learning systems. I wish I had access to this book early on in my professional career and evaded the mistakes I made along the way. . . . I am deeply convinced that this book is an essential read for anybody aiming to develop involved in developing a robust, high-performing NLP system. —Siddharth Sharma, ML Engineer, Facebook I feel this is not only an essential book for NLP practitioners, it is also a valuable reference for the research community to understand the problem spaces in real-world applications. I very much appreciate this book and wish this could be a long-term project with up-to-date NLP application trending! —Mengting Wan, Data Scientist (ML&NLP) at Airbnb, Microsoft Research Fellow
📄 Page
5
Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, and Harshit Surana Practical Natural Language Processing A Comprehensive Guide to Building Real-World NLP Systems Boston Farnham Sebastopol TokyoBeijing
📄 Page
6
978-1-492-05405-4 [LSI] Practical Natural Language Processing by Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, and Harshit Surana Copyright © 2020 Anuj Gupta, Bodhisattwa Prasad Majumder, Sowmya Vajjala, and Harshit Surana. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquistions Editor: Jonathan Hassell Developmental Editor: Melissa Potter Production Editor: Beth Kelly Copyeditor: Holly Forsyth Proofreader: Charles Roumeliotis Indexer: nSight Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest June 2020: First Edition Revision History for the First Edition 2020-06-17: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492054054 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Practical Natural Language Processing, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
📄 Page
7
This book is dedicated to our respective advisors: Detmar Meurers, Julian McAuley, Kannan Srinathan, and Luis von Ahn.
📄 Page
8
(This page has no text content)
📄 Page
9
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Part I. Foundations 1. NLP: A Primer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 NLP in the Real World 5 NLP Tasks 6 What Is Language? 8 Building Blocks of Language 9 Why Is NLP Challenging? 12 Machine Learning, Deep Learning, and NLP: An Overview 14 Approaches to NLP 16 Heuristics-Based NLP 16 Machine Learning for NLP 19 Deep Learning for NLP 22 Why Deep Learning Is Not Yet the Silver Bullet for NLP 28 An NLP Walkthrough: Conversational Agents 31 Wrapping Up 33 2. NLP Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Data Acquisition 39 Text Extraction and Cleanup 42 HTML Parsing and Cleanup 44 Unicode Normalization 45 Spelling Correction 46 vii
📄 Page
10
System-Specific Error Correction 47 Pre-Processing 49 Preliminaries 50 Frequent Steps 52 Other Pre-Processing Steps 55 Advanced Processing 57 Feature Engineering 60 Classical NLP/ML Pipeline 62 DL Pipeline 62 Modeling 62 Start with Simple Heuristics 63 Building Your Model 64 Building THE Model 65 Evaluation 68 Intrinsic Evaluation 68 Extrinsic Evaluation 71 Post-Modeling Phases 72 Deployment 72 Monitoring 72 Model Updating 73 Working with Other Languages 73 Case Study 74 Wrapping Up 76 3. Text Representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Vector Space Models 84 Basic Vectorization Approaches 85 One-Hot Encoding 85 Bag of Words 87 Bag of N-Grams 89 TF-IDF 90 Distributed Representations 92 Word Embeddings 94 Going Beyond Words 103 Distributed Representations Beyond Words and Characters 105 Universal Text Representations 107 Visualizing Embeddings 108 Handcrafted Feature Representations 112 Wrapping Up 113 viii | Table of Contents
📄 Page
11
Part II. Essentials 4. Text Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Applications 121 A Pipeline for Building Text Classification Systems 123 A Simple Classifier Without the Text Classification Pipeline 125 Using Existing Text Classification APIs 126 One Pipeline, Many Classifiers 126 Naive Bayes Classifier 127 Logistic Regression 131 Support Vector Machine 132 Using Neural Embeddings in Text Classification 134 Word Embeddings 134 Subword Embeddings and fastText 136 Document Embeddings 138 Deep Learning for Text Classification 140 CNNs for Text Classification 143 LSTMs for Text Classification 144 Text Classification with Large, Pre-Trained Language Models 145 Interpreting Text Classification Models 147 Explaining Classifier Predictions with Lime 148 Learning with No or Less Data and Adapting to New Domains 149 No Training Data 149 Less Training Data: Active Learning and Domain Adaptation 150 Case Study: Corporate Ticketing 152 Practical Advice 155 Wrapping Up 157 5. Information Extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 IE Applications 162 IE Tasks 164 The General Pipeline for IE 165 Keyphrase Extraction 166 Implementing KPE 167 Practical Advice 168 Named Entity Recognition 169 Building an NER System 171 NER Using an Existing Library 175 NER Using Active Learning 176 Practical Advice 177 Named Entity Disambiguation and Linking 178 NEL Using Azure API 179 Table of Contents | ix
📄 Page
12
Relationship Extraction 181 Approaches to RE 182 RE with the Watson API 184 Other Advanced IE Tasks 185 Temporal Information Extraction 186 Event Extraction 187 Template Filling 189 Case Study 190 Wrapping Up 193 6. Chatbots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Applications 198 A Simple FAQ Bot 199 A Taxonomy of Chatbots 200 Goal-Oriented Dialog 202 Chitchats 202 A Pipeline for Building Dialog Systems 203 Dialog Systems in Detail 204 PizzaStop Chatbot 206 Deep Dive into Components of a Dialog System 216 Dialog Act Classification 217 Identifying Slots 217 Response Generation 218 Dialog Examples with Code Walkthrough 219 Other Dialog Pipelines 224 End-to-End Approach 225 Deep Reinforcement Learning for Dialogue Generation 225 Human-in-the-Loop 226 Rasa NLU 227 A Case Study: Recipe Recommendations 230 Utilizing Existing Frameworks 231 Open-Ended Generative Chatbots 233 Wrapping Up 234 7. Topics in Brief. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Search and Information Retrieval 241 Components of a Search Engine 243 A Typical Enterprise Search Pipeline 246 Setting Up a Search Engine: An Example 247 A Case Study: Book Store Search 249 Topic Modeling 250 Training a Topic Model: An Example 254 x | Table of Contents
📄 Page
13
What’s Next? 255 Text Summarization 256 Summarization Use Cases 256 Setting Up a Summarizer: An Example 257 Practical Advice 258 Recommender Systems for Textual Data 260 Creating a Book Recommender System: An Example 261 Practical Advice 262 Machine Translation 263 Using a Machine Translation API: An Example 264 Practical Advice 265 Question-Answering Systems 266 Developing a Custom Question-Answering System 268 Looking for Deeper Answers 268 Wrapping Up 269 Part III. Applied 8. Social Media. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Applications 277 Unique Challenges 278 NLP for Social Data 284 Word Cloud 284 Tokenizer for SMTD 286 Trending Topics 286 Understanding Twitter Sentiment 288 Pre-Processing SMTD 290 Text Representation for SMTD 294 Customer Support on Social Channels 297 Memes and Fake News 299 Identifying Memes 299 Fake News 300 Wrapping Up 302 9. E-Commerce and Retail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 E-Commerce Catalog 308 Review Analysis 308 Product Search 309 Product Recommendations 309 Search in E-Commerce 309 Building an E-Commerce Catalog 312 Table of Contents | xi
📄 Page
14
Attribute Extraction 312 Product Categorization and Taxonomy 317 Product Enrichment 321 Product Deduplication and Matching 323 Review Analysis 324 Sentiment Analysis 325 Aspect-Level Sentiment Analysis 327 Connecting Overall Ratings to Aspects 329 Understanding Aspects 330 Recommendations for E-Commerce 332 A Case Study: Substitutes and Complements 333 Wrapping Up 336 10. Healthcare, Finance, and Law. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Healthcare 339 Health and Medical Records 341 Patient Prioritization and Billing 342 Pharmacovigilance 342 Clinical Decision Support Systems 342 Health Assistants 342 Electronic Health Records 344 Mental Healthcare Monitoring 353 Medical Information Extraction and Analysis 355 Finance and Law 358 NLP Applications in Finance 360 NLP and the Legal Landscape 363 Wrapping Up 366 Part IV. Bringing It All Together 11. The End-to-End NLP Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 Revisiting the NLP Pipeline: Deploying NLP Software 372 An Example Scenario 374 Building and Maintaining a Mature System 376 Finding Better Features 377 Iterating Existing Models 378 Code and Model Reproducibility 379 Troubleshooting and Interpretability 379 Monitoring 382 Minimizing Technical Debt 383 Automating Machine Learning 384 xii | Table of Contents
📄 Page
15
The Data Science Process 388 The KDD Process 388 Microsoft Team Data Science Process 390 Making AI Succeed at Your Organization 392 Team 392 Right Problem and Right Expectations 393 Data and Timing 394 A Good Process 395 Other Aspects 396 Peeking over the Horizon 398 Final Words 401 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 Table of Contents | xiii
📄 Page
16
(This page has no text content)
📄 Page
17
Foreword The field of natural language processing (NLP) has undergone a dramatic shift in recent years, both in terms of methodology and in terms of the applications sup‐ ported. Methodological advances have ranged from new ways of representing docu‐ ments to new techniques for language synthesis. With these have come new applications ranging from open-ended conversational systems to techniques that use natural language for model interpretability. Finally, these advances have seen NLP gain a foothold in related areas, such as computer vision and recommender systems, some of which my lab is working on with support from Amazon, Samsung, and the National Science Foundation. As NLP is expanding into these exciting new areas, so too has the audience of practi‐ tioners wanting to make use of NLP techniques. In the Data Science course (CSE 258) that I take at the University of California–San Diego, which is often the most attended in the computer science department, I see that more and more students are doing their projects on NLP-based topics. NLP is rapidly becoming a necessary skill required by engineers, product managers, scientists, students, and enthusiasts wish‐ ing to build applications on top of natural language data. On one hand, new tools and libraries for NLP and machine learning have made natural language modeling more accessible than ever. But on the other hand, resources for learning NLP must target this ever-growing and diverse audience. This is especially true for organizations that have recently adopted NLP or for students working with natural language data for the first time. It has been my pleasure over the last few years to collaborate with Bodhisattwa Majumder on exciting new applications in NLP and dialog, so I was thrilled to hear about his efforts (along with Sowmya Vajjala, Anuj Gupta, and Harshit Surana) to write a book on NLP. They have a wide experience in scaling NLP including at early- stage startups, the MIT Media Lab, Microsoft Research, and Google AI. I am excited by the end-to-end approach taken in their book, which will make it use‐ ful for a range of scenarios and will help readers to work with the labyrinth of xv
📄 Page
18
possible options while building NLP applications. I am especially thrilled about the emphasis on modern NLP applications such as chatbots, as well as the focus on inter‐ disciplinary topics such as ecommerce and retail. These topics will be especially use‐ ful for industry leaders and researchers, and are critical subjects that have been given only limited coverage in existing textbooks. This book is ideal both as a first resource to discover the field of natural language processing and a guide for seasoned practi‐ tioners looking to discover the latest developments in this exciting area. — Julian McAuley Professor of Computer Science and Engineering University of California, San Diego xvi | Foreword
📄 Page
19
Preface Natural language processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. It concerns building systems that can process and understand human language. Since its inception in the 1950s and until very recently, NLP has primarily been the domain of academia and research labs, requir‐ ing long formal education and training. The past decade’s breakthroughs have resul‐ ted in NLP being increasingly used in a range of diverse domains such as retail, healthcare, finance, law, marketing, human resources, and many more. There are a range of driving forces for these developments: • Widely available and easy-to-use NLP tools, techniques, and APIs are now all- pervading in the industry. There has never been a better time to build quick NLP solutions. • Development of more interpretable and generalized approaches has improved the baseline performance for even complex NLP tasks, such as open-domain conversational tasks and question answering, which were not practically feasible before. • More and more organizations, including Google, Microsoft, and Amazon, are investing heavily in more interactive consumer products, where language is used as the primary medium of communication. • Increased availability of useful open source datasets, along with standard bench‐ marks on them, has acted as a catalyst in this revolution, as opposed to being impeded by proprietary datasets only available to limited organizations and individuals. • The viability of NLP has moved beyond English or other major languages. Data‐ sets and language-specific models are being created for the less-frequently digi‐ tized languages too. A fruitful product that came out this effort was a near- perfect automatic machine translation tool available to all individuals with a smartphone. xvii
📄 Page
20
With this rapidly expanding usage, a growing proportion of the workforce that builds these NLP systems is grappling with limited experience and theoretical knowledge about the topic. This book addresses this need from an applied perspective. Our book aims to guide the readers to build, iterate, and scale NLP systems in a business set‐ ting, and to tailor them for various industry verticals. Why We Wrote This Book There are many popular books on NLP available. While some of these serve as text‐ books, focusing on theoretical aspects, some others aim to introduce NLP concepts through a lot of code examples. There are a few others that focus on specific NLP or machine learning libraries and provide “how-to” guides on solving different NLP problems using the libraries. So, why do we need another book on NLP? We have been building and scaling NLP solutions for over a decade at leading univer‐ sities and technology companies. While mentoring colleagues and other engineers, we noticed a gap between NLP practice in the industry and the NLP skill sets of new engineers and those who are just starting with NLP in particular. We started under‐ standing these gaps even better during NLP workshops we were conducting for industry professionals, where we noticed that business and engineering leaders also have these gaps. Most online courses and books tackle NLP problems using toy use cases and popular (often large, clean, and well-defined) datasets. While this imparts the general meth‐ ods of NLP, we believe it does not provide enough of a foundation to tackle new problems and develop specific solutions in the real world. Commonly encountered problems while building real-world applications, such as data collection, working with noisy data and signals, incremental development of solutions, and issues involved in deploying the solutions as a part of a larger application, are not dealt with by existing resources, to the best of our knowledge. We also saw that best practices to develop NLP systems were missing in most scenarios. We felt a book was needed to bridge this gap, and that is how this book was born! The Philosophy We want to provide a holistic and practical perspective that enables the reader to suc‐ cessfully build real-world NLP solutions embedded in larger product setups. Thus, most chapters are accompanied by code walkthroughs in the associated Git reposi‐ tory. The book is also supplemented with extensive references for readers who want to delve deeper. Throughout the book, we start with a simple solution and incremen‐ tally build more complex solutions by taking a minimum viable product (MVP) approach, as commonly found in industry practice. We also give tips based on our experience and learnings. Where possible, each chapter is accompanied by a xviii | Preface