M A N N I N G Ekaterina Kochmar
Core NLP concepts and techniques covered in this book Core NLP concept or technique First introduction Cosine similarity Chapter 1 N-grams Chapter 1 Tokenization Chapter 2 Text normalization Chapter 2 Stemming Chapter 3 Stopwords removal Chapter 3 Chapter 3 Part-of-speech tagging Chapter 4 Lemmatization Chapter 4 Parsing Chapter 4 Linguistic feature engineering Chapter 6 Word senses Chapter 8 Topic modeling Chapter 10 Named entities Chapter 11 Term frequency–inverse document frequency Toolkit or library Examples of use Natural Language Toolkit (NLTK) Chapters 2, 3, 5, 8, 10 spaCy Chapters 4, 6, 7, 8, 11 displaCy Chapters 4, 11 scikit-learn Chapters 5, 6, 8, 9, 10 gensim Chapter 10 pyLDAvis Chapter 10 pandas Chapter 11 NLP and ML toolkits and libraries used in this book
Getting Started with Natural Language Processing
(This page has no text content)
Getting Started with Natural Language Processing EKATERINA KOCHMAR MANN I NG SHELTER ISLAND
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2022 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The author and publisher have made every effort to ensure that the information in this book was correct at press time. The author and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Manning Publications Co. Development editor: Dustin Archibald 20 Baldwin Road Technical development editor: Michael Lund PO Box 761 Review editor: Adriana Sabo Shelter Island, NY 11964 Production editor: Kathleen Rossland Copy editor: Carrie Andrews Proofreader: Jason Everett Technical proofreader: Al Krinker Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781617296765 Printed in the United States of America
To my family, who always supported me and believed in me.
(This page has no text content)
brief contents 1 ■ Introduction 1 2 ■ Your first NLP example 31 3 ■ Introduction to information search 71 4 ■ Information extraction 114 5 ■ Author profiling as a machine-learning task 151 6 ■ Linguistic feature engineering for author profiling 194 7 ■ Your first sentiment analyzer using sentiment lexicons 229 8 ■ Sentiment analysis with a data-driven approach 263 9 ■ Topic analysis 304 10 ■ Topic modeling 346 11 ■ Named-entity recognition 384vii
(This page has no text content)
contents preface xiii acknowledgments xv about this book xvii about the author xxii about the cover illustration xxiii 1 Introduction 1 1.1 A brief history of NLP 2 1.2 Typical tasks 5 Information search 5 ■ Advanced information search: Asking the machine precise questions 16 ■ Conversational agents and intelligent virtual assistants 18 ■ Text prediction and language generation 20 ■ Spam filtering 25 ■ Machine translation 26 Spell- and grammar checking 28 2 Your first NLP example 31 2.1 Introducing NLP in practice: Spam filtering 31 2.2 Understanding the task 36 Step 1: Define the data and classes 37 ■ Step 2: Split the text into words 37 ■ Step 3: Extract and normalize the features 42 Step 4: Train a classifier 43 ■ Step 5: Evaluate the classifier 45ix
CONTENTSx2.3 Implementing your own spam filter 46 Step 1: Define the data and classes 46 ■ Step 2: Split the text into words 49 ■ Step 3: Extract and normalize the features 50 Step 4: Train the classifier 53 ■ Step 5: Evaluate your classifier 62 2.4 Deploying your spam filter in practice 65 3 Introduction to information search 71 3.1 Understanding the task 72 Data and data structures 75 ■ Boolean search algorithm 83 3.2 Processing the data further 87 Preselecting the words that matter: Stopwords removal 87 Matching forms of the same word: Morphological processing 90 3.3 Information weighing 96 Weighing words with term frequency 97 ■ Weighing words with inverse document frequency 100 3.4 Practical use of the search algorithm 103 Retrieval of the most similar documents 104 ■ Evaluation of the results 106 ■ Deploying search algorithm in practice 111 4 Information extraction 114 4.1 Use cases 116 Case 1 116 ■ Case 2 117 ■ Case 3 119 4.2 Understanding the task 120 4.3 Detecting word types with part-of-speech tagging 124 Understanding word types 124 ■ Part-of-speech tagging with spaCy 128 4.4 Understanding sentence structure with syntactic parsing 137 Why sentence structure is important 137 ■ Dependency parsing with spaCy 139 4.5 Building your own information extraction algorithm 144 5 Author profiling as a machine-learning task 151 5.1 Understanding the task 153 Case 1: Authorship attribution 154 ■ Case 2: User profiling 155 5.2 Machine-learning pipeline at first glance 157 Original data 157 ■ Testing generalization behavior 163 Setting up the benchmark 169
CONTENTS xi5.3 A closer look at the machine-learning pipeline 175 Decision Trees classifier basics 175 ■ Evaluating which tree is better using node impurity 178 ■ Selection of the best split in Decision Trees 184 ■ Decision Trees on language data 185 6 Linguistic feature engineering for author profiling 194 6.1 Another close look at the machine-learning pipeline 196 Evaluating the performance of your classifier 196 ■ Further evaluation measures 197 6.2 Feature engineering for authorship attribution 200 Word and sentence length statistics as features 201 ■ Counts of stopwords and proportion of stopwords as features 207 Distributions of parts of speech as features 212 ■ Distribution of word suffixes as features 219 ■ Unique words as features 223 6.3 Practical use of authorship attribution and user profiling 226 7 Your first sentiment analyzer using sentiment lexicons 229 7.1 Use cases 231 7.2 Understanding your task 234 Aggregating sentiment score with the help of a lexicon 235 Learning to detect sentiment in a data-driven way 237 7.3 Setting up the pipeline: Data loading and analysis 239 Data loading and preprocessing 240 ■ A closer look into the data 243 7.4 Aggregating sentiment scores with a sentiment lexicon 251 Collecting sentiment scores from a lexicon 252 ■ Applying sentiment scores to detect review polarity 255 8 Sentiment analysis with a data-driven approach 263 8.1 Addressing multiple senses of a word with SentiWordNet 266 8.2 Addressing dependence on context with machine learning 277 Data preparation 278 ■ Extracting features from text 284 Scikit-learn’s machine-learning pipeline 289 ■ Full-scale evaluation with cross-validation 292 8.3 Varying the length of the sentiment-bearing features 295
CONTENTSxii8.4 Negation handling for sentiment analysis 298 8.5 Further practice 301 9 Topic analysis 304 9.1 Topic classification as a supervised machine-learning task 307 Data 308 ■ Topic classification with Naïve Bayes 312 Evaluation of the results 320 9.2 Topic discovery as an unsupervised machine-learning task 325 Unsupervised ML approaches 325 ■ Clustering for topic discovery 330 ■ Evaluation of the topic clustering algorithm 338 10 Topic modeling 346 10.1 Topic modeling with latent Dirichlet allocation 349 Exercise 10.1: Question 1 solution 349 ■ Exercise 10.1: Question 2 solution 351 ■ Estimating parameters for the LDA 352 LDA as a generative model 356 10.2 Implementation of the topic modeling algorithm 360 Loading the data 361 ■ Preprocessing the data 363 Applying the LDA model 371 ■ Exploring the results 375 11 Named-entity recognition 384 11.1 Named entity recognition: Definitions and challenges 388 Named entity types 388 ■ Challenges in named entity recognition 390 11.2 Named-entity recognition as a sequence labeling task 392 The basics: BIO scheme 393 ■ What does it mean for a task to be sequential? 395 ■ Sequential solution for NER 397 11.3 Practical applications of NER 403 Data loading and exploration 403 ■ Named entity types exploration with spaCy 406 ■ Information extraction revisited 410 ■ Named entities visualization 416 appendix Installation instructions 422 index 423
preface Thank you for choosing Getting Started with Natural Language Processing. I am very excited that you decided to learn about natural language processing (NLP) with the help of this book, and I hope that you’ll enjoy getting started with NLP following this material and the examples. Natural language processing addresses various types of tasks related to language and processing of information expressed in human language. The field and tech- niques have been around for quite a long time, and they are well integrated into our everyday lives; in fact, you are probably benefiting from NLP on a daily basis without realizing it. Therefore, I can’t really overemphasize the importance and the impact that this technology has on our lives. The first chapter of this book will give you an overview of the wide scope of NLP applications that you might be using regularly— from internet search engines to spam filters to predictive keyboards (and many more!), and the rest the book will help you to implement many of these applications from scratch yourself. In recent years, the field has been gaining more and more interest and attention. There are several reasons for this: on the one hand, thanks to the internet, we now have access to increasingly larger amounts of data. On the other hand, thanks to the recent developments in computer hardware and software, we have more powerful technology to process this data. The recent advances in machine learning and deep learning have also contributed to the increasing importance of NLP. These days, large tech companies are realizing the potential of using NLP, and businesses in legal tech, finance, insurance, health care, and many other sectors are investing in it. The reasonxiii
PREFACExivfor that is clear—language is the primary means of communication in all spheres of life, so being able to efficiently process the information expressed in the form of human language is always an advantage. This makes a book on NLP very timely. My goal with this book is to introduce you to a wide variety of topics related to natural lan- guage and its processing, and to show how and why these things matter in practical applications—be that your own small project or a company-level project that could benefit from extracting and using information from texts. I have been working in NLP for over a decade now, and before switching to NLP, I primarily focused on linguistics and theoretical studies of language. Looking back, what motivated and excited me the most about turning to the more technical field of NLP were the incredible new opportunities opened up to me by technology and the ease of working with data and getting the information you need from texts, whether in the context of academic studies about the language itself or in the context of practical applications in any other domain. This book aims to produce the same effect. It is highly practice oriented, and each language-related concept, each technique, and each task is explained with the help of real-life examples.
acknowledgments Writing a book is a long process that takes a lot of time and effort. I truly enjoyed working on this book, and I sincerely hope that you will enjoy reading it, too. Never- theless, it would be impossible to enjoy this process, or even to finish the book, were it not for the tremendous support, inspiration, and encouragement provided to me by my family, my partner Ted, and my dear friends Eugene, Alex, and Natalia. Thank you for believing in me! I am also extremely grateful to the Manning team and all the people at Manning who took time to review my book with such care and who gave me valuable feedback along the way. I’d like to acknowledge my development editor, Dustin Archibald, who was always there for me with his patience and support, especially when I needed those the most. I am also grateful to Michael Lund, my technical development editor, and Al Krinker, my technical proofreader, for carefully checking the content and the code for this book and providing me with valuable feedback. I would also like to extend my gratitude to Kathleen Rossland, my production editor; Carrie Andrews, my copy- editor; and Susan Honeywell and Azra Dedic, members of the graphics editing team, whose valuable help at the final stages of editing of this book improved it tremen- dously. Thanks as well to the rest of the Manning team who worked on the production and promotion of this book. I would also like to thank all the reviewers who took the time out of their busy schedules to read my manuscript at various stages of its development. Thanks to their invaluable feedback and advice, this book kept improving from earlier stages until it went into production. I would like to acknowledge Alessandro Buggin, Cage Slagel,xv
ACKNOWLEDGMENTSxviChristian Bridge-Harrington, Christian Thoudahl, Douglas Sparling, Elmer C. Peramo, Erik Hansson, Francisco Rivas, Ian D. Miller, James Richard Woodruff, Jason Hales, Jérôme Baton, Jonathan Wood, Joseph Perenia, Kelly Hair, Lewis Van Winkle, Luis Fernando Fontoura de Oliveira, Monica Guimaraes, Najeeb Arif, Patrick Regan, Rees Morrison, Robert Diana, Samantha Berk, Sumit K. Singh, Tanya Wilke, Walter Alexander Mata López, and Werner Nindl.
about this book The primary goal that I have for this book is to help you appreciate how truly exciting the field of NLP is, how limitless the possibilities of working in this area are, and how low the barrier to entry is now. My goal is to help you get started in this field easily and to show what a wide range of different applications you can implement yourself within a matter of days even if you have never worked in this field before. This book can be used both as a comprehensive cover-to-cover guide through a range of practical appli- cations and as a reference book if you are interested in only some of the practical tasks. By the time you finish reading this book, you will have acquired Knowledge about the essential NLP tasks and the ability to recognize any par- ticular task when you encounter it in a real-life scenario. We will cover such popular tasks as sentiment analysis, text classification, information search, and many more. A whole arsenal of NLP algorithms and techniques, including stemming, lem- matization, part-of-speech tagging, and many more. You will learn how to apply a range of practical approaches to text, such as vectorization, feature extraction, supervised and unsupervised machine learning, among others. An ability to structure an NLP project and an understanding of which steps need to be involved in a practical project. Comprehensive knowledge of the key NLP, as well as machine-learning, terminology. Comprehensive knowledge of the available resources and tools for NLP.xvii
ABOUT THIS BOOKxviiiWho should read this book I have written this book to be accessible to software developers and beginners in data science and machine learning. If you have done some programming in Python before and are familiar with high school math and algebra (e.g., matrices, vectors, and basic operations involving them), you should be good to go! Most importantly, the book does not assume any prior knowledge of linguistics or NLP, as it will help you learn what you need along the way. How this book is organized: A road map The first two chapters of this book introduce you to the field of natural language pro- cessing and the variety of NLP applications available. They also show you how to build your own small application with a minimal amount of specialized knowledge and skills in NLP. If you are interested in having a quick start in the field, I would recommend reading these two chapters. Each subsequent chapter looks more closely into a spe- cific NLP application, so if you are interested in any such specific application, you can just focus on a particular chapter. For a comprehensive overview of the field, tech- niques, and applications, I would suggest reading the book cover to cover: Chapter 1—Introduces the field of NLP with its various tasks and applications. It also briefly overviews the history of the field and shows how NLP applications are used in our everyday lives. Chapter 2—Explains how you can build your own practical NLP application (spam filtering) from scratch, walking you through all the essential steps in the application pipeline. While doing so, it introduces a number of fundamental NLP techniques, including tokenization and text normalization, and shows how to use them in practice via a popular NLP toolkit called NLTK. Chapter 3—Focuses on the task of information retrieval. It introduces several key NLP techniques, such as stemming and stopword removal, and shows how you can implement your own information-retrieval algorithm. It also explains how such an algorithm can be evaluated. Chapter 4—Looks into information extraction and introduces further funda- mental techniques, such as part-of-speech tagging, lemmatization, and depen- dency parsing. Moreover, it shows how to build an information-extraction application using another popular NLP toolkit called spaCy. Chapter 5—Shows how to implement your own author (or user) profiling algo- rithm, providing you with further examples and practice in NLTK and spaCy. Moreover, it presents the task as a text classification problem and shows how to implement a machine-learning classifier using a popular machine learning library called scikit-learn. Chapter 6—Follows up on the topic of author (user) profiling started in chap- ter 5. It investigates closely the task of linguistic feature engineering, which is an essential step in any NLP project. It shows how to perform linguistic feature
Comments 0
Loading comments...
Reply to Comment
Edit Comment