(This page has no text content)
NATURAL LANGUAGE PROCESSING WITH PYTHON AND SPACY A Practical Introduction by Yuli Vasiliev San Francisco
NATURAL LANGUAGE PROCESSING WITH PYTHON AND SPACY. Copyright © 2020 by Yuli Vasiliev. All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher. ISBN-10: 1-7185-0052-1 ISBN-13: 978-1-7185-0052-5 Publisher: William Pollock Production Editors: Kassie Andreadis and Laurel Chun Cover Illustration: Gina Redman Photography: Igor Shabalin Developmental Editor: Frances Saux Technical Reviewers: Ivan Brigida and Geoff Bacon Copyeditor: Anne Marie Walker Compositor: Happenstance Type-O-Rama Proofreader: James Fraleigh Indexer: Beth Nauman-Montana For information on distribution, translations, or bulk sales, please contact No Starch Press, Inc. directly: No Starch Press, Inc. 245 8th Street, San Francisco, CA 94103 phone: 1.415.863.9900; info@nostarch.com www.nostarch.com A catalog record of this book is available from the Library of Congress. No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc. Other product and company names mentioned herein may be the trademarks of their respective owners. Rather than use a trademark symbol with every occurrence of a trademarked name, we are using the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The information in this book is distributed on an “As Is” basis, without warranty. While every precaution has been taken in the preparation of this work, neither the author nor No Starch Press, Inc. shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in it.
About the Author Yuli Vasiliev is a programmer, freelance writer, and consultant specializing in open source development, Oracle database technologies, and natural language processing (NLP). Currently, he works as a consultant for the bot project Porphyry. The bot implements NLP techniques to give meaningful responses to user questions. A demo can be accessed at @Porphyry_bot in Telegram.
About the Technical Reviewer Ivan Brigida was born and raised in Krasnodar, Russia. He holds a Computer Science degree from Moscow State University and an MA in Economics from the New Economic School. He worked for several years as a financial analyst, and later moved to Google to become a digital advertising analyst. Currently, he is doing BI analytics and developing machine learning models for the Online Partnerships Group at Google, specializing in mobile app monetization.
BRIEF CONTENTS Introduction Chapter 1: How Natural Language Processing Works Chapter 2: The Text-Processing Pipeline Chapter 3: Working with Container Objects and Customizing spaCy Chapter 4: Extracting and Using Linguistic Features Chapter 5: Working with Word Vectors Chapter 6: Finding Patterns and Walking Dependency Trees Chapter 7: Visualizations Chapter 8: Intent Recognition Chapter 9: Storing User Input in a Database Chapter 10: Training Models Chapter 11: Deploying Your Own Chatbot Chapter 12: Implementing Web Data and Processing Images Appendix: Linguistic Primer Index
CONTENTS IN DETAIL INTRODUCTION Using Python for Natural Language Processing The spaCy Library Who Should Read This Book? What’s in the Book? 1 HOW NATURAL LANGUAGE PROCESSING WORKS How Can Computers Understand Language? Mapping Words and Numbers with Word Embedding Using Machine Learning for Natural Language Processing Why Use Machine Learning for Natural Language Processing? What Is a Statistical Model in NLP? Neural Network Models Convolutional Neural Networks for NLP What Is Still on You Keywords Context Meaning Transition Summary
2 THE TEXT-PROCESSING PIPELINE Setting Up Your Working Environment Installing Statistical Models for spaCy Basic NLP Operations with spaCy Tokenization Lemmatization Applying Lemmatization for Meaning Recognition Part-of-Speech Tagging Using Part-of-Speech Tags to Find Relevant Verbs Context Is Important Syntactic Relations Try This Named Entity Recognition Summary 3 WORKING WITH CONTAINER OBJECTS AND CUSTOMIZING SPACY spaCy’s Container Objects Getting the Index of a Token in a Doc Object Iterating over a Token’s Syntactic Children The doc.sents Container
The doc.noun_chunks Container Try This The Span Object Try This Customizing the Text-Processing Pipeline Disabling Pipeline Components Loading a Model Step by Step Customizing the Pipeline Components Using spaCy’s C-Level Data Structures How It Works Preparing Your Working Environment and Getting Text Files Your Cython Script Building a Cython Module Testing the Module Summary 4 EXTRACTING AND USING LINGUISTIC FEATURES Extracting and Generating Text with Part-of-Speech Tags Numeric, Symbolic, and Punctuation Tags Extracting Descriptions of Money Try This
Turning Statements into Questions Try This Using Syntactic Dependency Labels in Text Processing Distinguishing Subjects from Objects Deciding What Question a Chatbot Should Ask Try This Summary 5 WORKING WITH WORD VECTORS Understanding Word Vectors Defining Meaning with Coordinates Using Dimensions to Represent Meaning The Similarity Method Choosing Keywords for Semantic Similarity Calculations Installing Word Vectors Taking Advantage of Word Vectors That Come with spaCy Models Using Third-Party Word Vectors Comparing spaCy Objects Using Semantic Similarity for Categorization Tasks Extracting Nouns as a Preprocessing Step Try This
Extracting and Comparing Named Entities Summary 6 FINDING PATTERNS AND WALKING DEPENDENCY TREES Word Sequence Patterns Finding Patterns Based on Linguistic Features Try This Checking an Utterance for a Pattern Using spaCy’s Matcher to Find Word Sequence Patterns Applying Several Patterns Creating Patterns Based on Customized Features Choosing Which Patterns to Apply Using Word Sequence Patterns in Chatbots to Generate Statements Try This Extracting Keywords from Syntactic Dependency Trees Walking a Dependency Tree for Information Extraction Iterating over the Heads of Tokens Condensing a Text Using Dependency Trees Try This Using Context to Improve the Ticket-Booking Chatbot
Making a Smarter Chatbot by Finding Proper Modifiers Summary 7 VISUALIZATIONS Getting Started with spaCy’s Built-In Visualizers displaCy Dependency Visualizer displaCy Named Entity Visualizer Visualizing from Within spaCy Visualizing Dependency Parsing Try This Sentence-by-Sentence Visualizations Customizing Your Visualizations with the Options Argument Using Dependency Visualizer Options Try This Using Named Entity Visualizer Options Exporting a Visualization to a File Using displaCy to Manually Render Data Formatting the Data Try This Summary 8 INTENT RECOGNITION
Extracting the Transitive Verb and Direct Object for Intent Recognition Obtaining the Transitive Verb/Direct Object Pair Extracting Multiple Intents with token.conjuncts Try This Using Word Lists to Extract the Intent Finding the Meanings of Words Using Synonyms and Semantic Similarity Recognizing Synonyms Using Predefined Lists Try This Recognizing Implied Intents Using Semantic Similarity Try This Extracting Intent from a Sequence of Sentences Walking the Dependency Structures of a Discourse Replacing Proforms with Their Antecedents Try This Summary 9 STORING USER INPUT IN A DATABASE Converting Unstructured Data into Structured Data Extracting Data into Interchange Formats Moving Application Logic to the Database
Building a Database-Powered Chatbot Gathering the Data and Building a JSON Object Converting Number Words to Numbers Preparing Your Database Environment Sending Data to the Underlying Database When a User’s Request Doesn’t Contain Enough Information Try This Summary 10 TRAINING MODELS Training a Model’s Pipeline Component Training the Entity Recognizer Deciding Whether You Need to Train the Entity Recognizer Creating Training Examples Automating the Example Creation Process Disabling the Other Pipeline Components The Training Process Evaluating the Updated Recognizer Creating a New Dependency Parser Custom Syntactic Parsing to Understand User Input Deciding on Types of Semantic Relations to Use
Creating Training Examples Training the Parser Testing Your Custom Parser Try This Summary 11 DEPLOYING YOUR OWN CHATBOT How Implementing and Deploying a Chatbot Works Using Telegram as a Platform for Your Bot Creating a Telegram Account and Authorizing Your Bot Getting Started with the python-telegram-bot Library Using the telegram.ext Objects Creating a Telegram Chatbot That Uses spaCy Expanding the Chatbot Holding the State of the Current Chat Putting All the Pieces Together Try This Summary 12 IMPLEMENTING WEB DATA AND PROCESSING IMAGES How It Works
Making Your Bot Find Answers to Questions from Wikipedia Determining What the Question Is About Try This Using Wikipedia to Answer User Questions Try This Reacting to Images Sent in a Chat Generating Descriptive Tags for Images Using Clarifai Using Tags to Generate Text Responses to Images Putting All the Pieces Together in a Telegram Bot Importing the Libraries Writing the Helper Functions Writing the Callback and main() Functions Testing the Bot Try This Summary LINGUISTIC PRIMER Dependency Grammars vs. Phrase Structure Grammars Common Grammar Concepts Transitive Verbs and Direct Objects Prepositional Objects Modal Auxiliary Verbs
INTRODUCTION Increasingly, when you call the bank or your internet provider, you might hear something like the following on the other end of the line: “Hello, I am your digital assistant. Please ask your question.” Today, robots can talk to humans using natural language, and they’re getting smarter. Even so, very few people understand how these robots work or how they might use these technologies in their own projects. Natural language processing (NLP)—a branch of artificial intelligence that helps machines understand and respond to human language—is the key technology that lies at the heart of any digital assistant product. This book arms you with the skills you need to start creating your own NLP applications. By the end of this book, you’ll know how to apply NLP approaches to real-world problems, such as analyzing sentences, capturing the meaning of a text, composing original texts, and even building your own chatbot. USING PYTHON FOR NATURAL LANGUAGE PROCESSING If you want to develop an NLP application, you can choose among a wide range of tools and technologies. All the examples in this book are implemented with Python code that employs the spaCy NLP library for Python. Here are some
compelling reasons why you might want to choose Python and spaCy for your NLP development. Python is a high-level programming language known for the following features: Simplicity If you’re new to programming, Python is a good language with which to start, because it’s extremely easy to learn. Due to its simplicity, Python allows you to write code that others can easily understand. For example, Python’s simplicity helps chatbot developers collaborate with linguists who don’t have much programming experience. Prevalence Python is one of the most popular languages. The vast majority of the widely used APIs have Python wrappers that you can easily install using the pip installation tool. The ability to install Python wrappers via the pip simplifies the process of obtaining third-party tools you might want to use in your NLP applications. Significant presence in the AI ecosystem There are a lot of Python libraries available in the AI ecosystem. This availability simplifies the development of your NLP applications, allowing you to choose among a range of libraries to best solve a particular task. THE SPACY LIBRARY This book uses spaCy, a popular Python library that contains the linguistic data and algorithms you’ll need to process natural language texts. As you’ll learn in this book, spaCy is easy to use because it provides container objects that represent elements of natural language texts, such as sentences and words. These objects, in turn, have attributes that represent linguistic features, like parts of speech. At the time of this writing, spaCy offered pretrained models for English, German, Greek, Spanish, French, Italian, Lithuanian, Norwegian Bokmål, Dutch, Portuguese, and multiple languages combined.
In addition, spaCy offers built-in visualizers that you can invoke programmatically to generate a graphic of the syntactic structure of a sentence or named entities in a document. The spaCy library also natively supports advanced NLP features that other popular NLP libraries for Python don’t. For example, spaCy natively supports word vectors (discussed in detail in Chapter 5), unlike the Natural Language Toolkit (NLTK). When using the latter, you would need to use a third- party tool like Gensim, a Python implementation of the word2vec algorithm. With spaCy, you can customize existing models or individual model components, and you can train your own models from scratch to meet your application’s requirements (you’ll learn how to do this in Chapter 10). You can also connect the statistical models trained by other popular machine learning (ML) libraries, such as TensorFlow, Keras, scikit-learn, and PyTorch. In addition, spaCy can operate seamlessly with other libraries in Python’s AI ecosystem, allowing you to, for example, take advantage of computer vision in your chatbot application, as you’ll do in Chapter 12. WHO SHOULD READ THIS BOOK? This book is for those interested in learning how to use NLP in practice. In particular, it might be interesting to people who want to develop chatbots for businesses or just for fun. Regardless of your background or experience with NLP or programming, you’ll be able to follow the code examples provided in this book because they all include detailed explanations of the process involved. Some working knowledge of Python will be helpful, because the book doesn’t cover the basics of Python syntax. Also, the examples assume a school-level understanding of English grammar and syntax. The Appendix is a reference for some of the less well-known linguistic concepts. If you have a good
Comments 0
Loading comments...
Reply to Comment
Edit Comment