Statistics
96
Views
16
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2025-11-09

AuthorStefania Cristina, Mehreen Saeed

If you have been around long enough, you should notice that your search engine can understand human language much better than in previous years. The game changer was the attention mechanism. It is not an easy topic to explain, and it is sad to see someone consider that as secret magic. If we know more about attention and understand the problem it solves, we can decide if it fits into our project and be more comfortable using it. If you are interested in natural language processing and want to tap into the most advanced technique in deep learning for NLP, this new Ebook—in the friendly Machine Learning Mastery style that you’re used to—is all you need. Using clear explanations and step-by-step tutorial lessons, you will learn how attention can get the job done and why we build transformer models to tackle the sequence data. You will also create your own transformer model that translates sentences from one language to another.

Tags
No tags
Publisher: independently published
Publish Year: 2022
Language: 英文
File Format: PDF
File Size: 7.4 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

(This page has no text content)
i Disclaimer The information contained within this eBook is strictly for educational purposes. If you wish to apply ideas contained in this eBook, you are taking full responsibility for your actions. The author has made every effort to ensure the accuracy of the information within this book was correct at time of publication. The author does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. No part of this eBook may be reproduced or transmitted in any form or by any means, electronic or mechanical, recording or by any information storage and retrieval system, without written permission from the author. Credits Authors: Stefania Cristina and Mehreen Saeed Lead Editor: Adrian Tam Technical Reviewers: Darci Heikkinen, Devansh Sethi, and Jerry Yiu Copyright Building Transformer Models with Attention © 2022 MachineLearningMastery.com. All Rights Reserved. Edition: v1.00
Brief Contents I Foundations of Attention 1 1 What Is Attention? . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 A Bird’s Eye View of Research on Attention . . . . . . . . . . . . . . . . 7 3 A Tour of Attention-Based Architectures . . . . . . . . . . . . . . . . . 15 4 The Bahdanau Attention Mechanism . . . . . . . . . . . . . . . . . . . 23 5 The Luong Attention Mechanism . . . . . . . . . . . . . . . . . . . . . 28 II From Recurrent Neural Networks to Transformer 33 6 An Introduction to Recurrent Neural Networks. . . . . . . . . . . . . . . 34 7 Understanding Simple Recurrent Neural Networks in Keras . . . . . . . . . 40 8 The Attention Mechanism from Scratch . . . . . . . . . . . . . . . . . . 49 9 Adding a Custom Attention Layer to Recurrent Neural Network in Keras . . . 55 10 The Transformer Attention Mechanism . . . . . . . . . . . . . . . . . . 66 11 The Transformer Model . . . . . . . . . . . . . . . . . . . . . . . . . 72 12 The Vision Transformer Model . . . . . . . . . . . . . . . . . . . . . . 79 III Building a Transformer from Scratch 86 13 Positional Encoding in Transformer Models . . . . . . . . . . . . . . . . 87 14 Transformer Positional Encoding Layer in Keras . . . . . . . . . . . . . . 94 15 Implementing Scaled Dot-Product Attention in Keras . . . . . . . . . . . . 104 16 Implementing Multi-Head Attention in Keras . . . . . . . . . . . . . . . 111 17 Implementing the Transformer Encoder in Keras . . . . . . . . . . . . . . 121 18 Implementing the Transformer Decoder in Keras . . . . . . . . . . . . . . 131 19 Joining the Transformer Encoder and Decoder with Masking. . . . . . . . . 140 20 Training the Transformer Model . . . . . . . . . . . . . . . . . . . . . 151
iii 21 Plotting the Training and Validation Loss Curves for the Transformer Model. . 165 22 Inference with the Transformer Model. . . . . . . . . . . . . . . . . . . 176 IV Application 184 23 A Brief Introduction to BERT . . . . . . . . . . . . . . . . . . . . . . 185
Contents Preface ix Introduction x I Foundations of Attention 1 1 What Is Attention? 2 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Attention in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 A Bird’s Eye View of Research on Attention 7 The Concept of Attention . . . . . . . . . . . . . . . . . . . . . . . . . 7 Attention in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 A Tour of Attention-Based Architectures 15 The Encoder-Decoder Architecture . . . . . . . . . . . . . . . . . . . . . 15 The Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Memory-Augmented Neural Networks . . . . . . . . . . . . . . . . . . . . 20 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 The Bahdanau Attention Mechanism 23 Introduction to the Bahdanau Attention. . . . . . . . . . . . . . . . . . . 23 The Bahdanau Architecture . . . . . . . . . . . . . . . . . . . . . . . . 24 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
v 5 The Luong Attention Mechanism 28 Introduction to the Luong Attention . . . . . . . . . . . . . . . . . . . . 28 The Luong Attention Algorithm . . . . . . . . . . . . . . . . . . . . . . 29 The Global Attentional Model . . . . . . . . . . . . . . . . . . . . . . . 29 The Local Attentional Model . . . . . . . . . . . . . . . . . . . . . . . 30 Comparison to the Bahdanau Attention . . . . . . . . . . . . . . . . . . . 31 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 II From Recurrent Neural Networks to Transformer 33 6 An Introduction to Recurrent Neural Networks 34 What Is a Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . 34 Unfolding a Recurrent Neural Network . . . . . . . . . . . . . . . . . . . 35 Training a Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . 36 Types of RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Different RNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . 38 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7 Understanding Simple Recurrent Neural Networks in Keras 40 Keras SimpleRNN Layer . . . . . . . . . . . . . . . . . . . . . . . . . 40 Running the RNN on Sunspots Dataset . . . . . . . . . . . . . . . . . . . 43 Consolidated Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 8 The Attention Mechanism from Scratch 49 The Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 49 The General Attention Mechanism . . . . . . . . . . . . . . . . . . . . . 50 The General Attention Mechanism with NumPy and SciPy . . . . . . . . . . 51 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 9 Adding a Custom Attention Layer to Recurrent Neural Network in Keras 55 Preparing Dataset for Time Series Forecasting . . . . . . . . . . . . . . . . 55 The SimpleRNN Network . . . . . . . . . . . . . . . . . . . . . . . . . 56 Adding a Custom Attention Layer to the Network . . . . . . . . . . . . . . 60 Consolidated Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 10 The Transformer Attention Mechanism 66 Introduction to the Transformer Attention. . . . . . . . . . . . . . . . . . 66 Scaled Dot-Product Attention . . . . . . . . . . . . . . . . . . . . . . . 68 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
vi Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 11 The Transformer Model 72 The Transformer Architecture . . . . . . . . . . . . . . . . . . . . . . . 72 Sum Up: The Transformer Model . . . . . . . . . . . . . . . . . . . . . 76 Comparison to Recurrent and Convolutional Layers. . . . . . . . . . . . . . 77 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 12 The Vision Transformer Model 79 Introduction to the Vision Transformer (ViT) . . . . . . . . . . . . . . . . 80 The ViT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Training the ViT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Inductive Bias in Comparison to Convolutional Neural Networks . . . . . . . . 82 Comparative Performance of ViT Variants with ResNets . . . . . . . . . . . 83 Internal Representation of Data . . . . . . . . . . . . . . . . . . . . . . 83 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 III Building a Transformer from Scratch 86 13 Positional Encoding in Transformer Models 87 What is Positional Encoding? . . . . . . . . . . . . . . . . . . . . . . . 87 Positional Encoding Layer in Transformers . . . . . . . . . . . . . . . . . 88 Coding the Positional Encoding Matrix from Scratch . . . . . . . . . . . . . 89 Understanding the Positional Encoding Matrix . . . . . . . . . . . . . . . . 90 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 14 Transformer Positional Encoding Layer in Keras 94 The Text Vectorization Layer . . . . . . . . . . . . . . . . . . . . . . . 94 The Embedding Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Subclassing the Keras Embedding Layer. . . . . . . . . . . . . . . . . . . 97 Positional Encoding in Transformers . . . . . . . . . . . . . . . . . . . . 99 Visualizing the Final Embedding . . . . . . . . . . . . . . . . . . . . . . 100 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 15 Implementing Scaled Dot-Product Attention in Keras 104 Recap of the Transformer Architecture . . . . . . . . . . . . . . . . . . . 104 Implementing the Scaled Dot-Product Attention from Scratch . . . . . . . . . 106 Testing Out the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
vii 16 Implementing Multi-Head Attention in Keras 111 Recap of Multi-Head Attention. . . . . . . . . . . . . . . . . . . . . . . 111 Implementing Multi-Head Attention from Scratch . . . . . . . . . . . . . . 113 Testing Out the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 17 Implementing the Transformer Encoder in Keras 121 Recap of the Transformer Encoder . . . . . . . . . . . . . . . . . . . . . 121 Implementing the Transformer Encoder from Scratch . . . . . . . . . . . . . 122 Testing Out the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 18 Implementing the Transformer Decoder in Keras 131 Recap of the Transformer Decoder . . . . . . . . . . . . . . . . . . . . . 131 Implementing the Transformer Decoder from Scratch . . . . . . . . . . . . . 133 Testing Out the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 19 Joining the Transformer Encoder and Decoder with Masking 140 Recap of the Transformer Architecture . . . . . . . . . . . . . . . . . . . 140 Masking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Joining the Transformer Encoder and Decoder . . . . . . . . . . . . . . . . 143 Creating an Instance of the Transformer Model . . . . . . . . . . . . . . . 145 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 20 Training the Transformer Model 151 Preparing the Training Dataset . . . . . . . . . . . . . . . . . . . . . . 151 Applying a Padding Mask . . . . . . . . . . . . . . . . . . . . . . . . . 155 Training the Transformer Model . . . . . . . . . . . . . . . . . . . . . . 156 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 21 Plotting the Training and Validation Loss Curves for the Transformer Model 165 Preparing the Training, Validation, and Testing Splits of the Dataset . . . . . . 166 Training the Transformer Model . . . . . . . . . . . . . . . . . . . . . . 168 Plotting the Training and Validation Loss Curves. . . . . . . . . . . . . . . 173 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 22 Inference with the Transformer Model 176 Inferencing the Transformer Model . . . . . . . . . . . . . . . . . . . . . 176 Testing Out the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
viii Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 IV Application 184 23 A Brief Introduction to BERT 185 From Transformer Model to BERT . . . . . . . . . . . . . . . . . . . . . 185 What Can BERT Do? . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Using Pre-Trained BERT Model for Summarization. . . . . . . . . . . . . . 188 Using Pre-Trained BERT Model for Question-Answering . . . . . . . . . . . 189 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 V Appendix 191 A How to Setup Python on Your Workstation 192 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Download Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Install Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Start and Update Anaconda . . . . . . . . . . . . . . . . . . . . . . . . 195 Install Deep Learning Libraries. . . . . . . . . . . . . . . . . . . . . . . 198 Install Visual Studio Code for Python. . . . . . . . . . . . . . . . . . . . 198 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 B How to Setup Amazon EC2 for Deep Learning on GPUs 202 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Setup Your AWS Account . . . . . . . . . . . . . . . . . . . . . . . . . 203 Launch Your Server Instance. . . . . . . . . . . . . . . . . . . . . . . . 204 Login, Configure and Run . . . . . . . . . . . . . . . . . . . . . . . . . 206 Build and Run Models on AWS . . . . . . . . . . . . . . . . . . . . . . 209 Close Your EC2 Instance . . . . . . . . . . . . . . . . . . . . . . . . . 209 Tips and Tricks for Using Keras on AWS . . . . . . . . . . . . . . . . . . 210 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 How Far You Have Come 212
Preface It is not an easy task to ask a computer to understand human language. In recent years, we have seen significant progress due to the advance in machine learning techniques. In particular, attention mechanisms and transformers. Take machine translation as an example. In the past we would consider that as a sequence to sequence transformation problem that a recurrent neural network would fit. But instead of a simple linear transformation, using an attention mechanism was proven to work better with longer sentences. Later, it is discovered that attention without recurrent neural network is not only possible, but also better in many situations. This book is a guide to lead you to fully understand attention and transformer architecture. We start with the first principle: to build a transformer model in Keras, from scratch. We hope by the time you finish the book, you can appreciate the idea of using attention to extract context out of a sequence.
Introduction Welcome to Building Transformer Models with Attention. A Recurrent Neural Network (RNN) has been considered magical, and someone even called it unreasonably effective.1 However, it is not almighty. In machine translation, we have seen that an RNN can give sensible output but not always correct and accurate. While it is a neural network, it is not about the network being too simple or that we didn’t train it enough. On the contrary, no matter how hard we train an RNN, there is a ceiling that it cannot breakthrough. Researchers noticed that when using an RNN for translation from one language to another, the neural network reads one word at a time but never sees the entire sentence. Therefore, the traditional way of using an RNN means it will lose the context. Attention is how to mitigate this situation. But it is not as simple as the linear transformation that we usually see in neural networks. Furthermore, researchers found that attention is not necessary to use an RNN. Attention itself can be extended to be a neural network as well. If we do that, we translate words one by one to some encoding or translate the encoding to words. This is a transformer. However, building an effective transformer for the translation of human languages is not trivial. Partially it is due to the high dimensionality of languages, i.e., any language has thousands of words and can carry a tremendous amount of information. It is also due to the complex architecture of the transformer. But once this hurdle is overcome, you will find the capability of a neural network to deal with human language at a new stage. For example, BERT is an extension of the transformer encoder. We saw that it can be used to build a named entity recognition (NER) system effectively. Another example is GPT-2, which is an extension of the transformer decoder. We saw that it can be used to build a natural language generator that produces realistic-looking paragraphs. These two examples are much larger than the original transformer and very slow to train, but undeniably have their root in attention and transformer. This book is to guide you through creating a transformer, step-by-step. By doing that, you will learn how to transform a word in a language to an embedding vector, how to implement the attention mechanism, to how a transformer is constructed, and eventually, use it to perform a language translation task. 1http://karpathy.github.io/2015/05/21/rnn-effectiveness/
xi Book Organization This book is in four parts: Part 1: Foundation of Attention In this part you will learn about the theoretical background of attention mechanism. In particular, you will see how attention is defined mathematically and the algorithm to get it. You will also see from a high level how attention, a concept to understand a sequence, is incorporated into a larger neural network architecture as well as its application. This part of the book includes the following chapters: ⊲ What Is Attention? ⊲ A Bird’s Eye View of Research on Attention ⊲ A Tour of Attention-Based Architectures ⊲ The Bahdanau Attention Mechanism ⊲ The Luong Attention Mechanism Part 2: From Recurrent Neural Networks to Transformer Since attention was originally designed for recurrent neural networks, we follow the footstep of its history to start from a traditional recurrent neural network and add an attention layer into it. You may have forgotten how a recurrent neural network is structured, therefore we start from an introduction to the RNN. This part of the book includes the following chapter: ⊲ An Introduction to Recurrent Neural Networks ⊲ Understanding Simple Recurrent Neural Networks in Keras ⊲ The Attention Mechanism from Scratch ⊲ Adding a Custom Attention Layer to Recurrent Neural Network in Keras ⊲ The Transformer Attention Mechanism ⊲ The Transformer Model ⊲ The Vision Transformer Model At the end of this part, we introduce how an attention mechanism can stand on its own without an RNN. This is how a transformer is created. While the entire story of attention and transformer is motivated by applying neural networks to natural language processing tasks, the last chapter of this part give you an angle from computer vision to show you that the potential of transformer is not limited to NLP. Part 3: Building a Transformer from Scratch Unlike other chapters of this book, you are required to read the chapters of this book in its prescribed sequence. The ten chapters in this part lead you into building a fully working transformer model from scratch. We start from the first step, namely, adding positional
xii encoding to input sequence, and end with using a trained transformer model for inference. This part of the book includes the following chapters: ⊲ Positional Encoding in Transformer Models ⊲ Transformer Positional Encoding Layer in Keras ⊲ Implementing Scaled Dot-Product Attention in Keras ⊲ Implementing Multi-Head Attention in Keras ⊲ Implementing the Transformer Encoder in Keras ⊲ Implementing the Transformer Decoder in Keras ⊲ Joining the Transformer Encoder and Decoder with Masking ⊲ Training the Transformer Model ⊲ Plotting the Training and Validation Loss Curves for the Transformer Model ⊲ Inference with the Transformer Model There are a lot to cover in these chapters because of the complexity in transformer architecture. Be patient. But you will find it not difficult to have your own transformer created out of basic Keras functions. Part 4: Application While you already created your own transformer and by the end of last part, your transformer should be able to do sentence to sentence translation between two languages in a reasonable quality. However, the story of transformer is not stop here. There are larger transformer-based architectures proposed with pre-trained model weights made public. We will look into one example and see how we can do some amazing projects with the pre-trained model. There is only one chapter in this part. It is: ⊲ A Brief Introduction to BERT In this chapter you will learn about BERT, which is an extension to transformer’s encoder, and its simplified model, DistilBERT. You will see how you can do summarization and question- answering with a pre-trained DistilBERT model. Requirements for This Book Python and TensorFlow 2.x This book covers some advanced topic. You do not need to be a Python expert, but you need to know how to install and setup Python and TensorFlow. You need to be able to install libraries if required and you should be able to navigate through the Python development environment comfortably. You may set up your environment on your workstation or laptop. It can be in a VM or a Docker instance that you run, or it may be a server that you can configure in the cloud.
xiii Appendix A and Appendix B of this book gives you step-by-step guidance on how to set up a Python environment on your own computer and on AWS cloud, respectively. Machine Learning You do not need to be a machine learning expert, but it would be helpful if you know how to solve a small machine learning problem, especially a natural language processing task. Basic concepts like cross-validation are described briefly, and you are expected to know how to train and use a neural network in TensorFlow and Keras. You may learn about these in another book, Deep Learning with Python. Training a transformer model can take a long time. It is possible to train it using a CPU but GPU can speed it up significantly. You can access GPU hardware easily and cheaply in the cloud and a step-by-step procedure is taught on how to do this in Appendix B. Your Outcomes from Reading This Book This book is a guidebook to help you learn the internal of a transformer model and the attention mechanism. Upon finishing the book, you should be able to explain clearly why attention works and how a transformer can handle a sequence such as a paragraph of words.Specifically, you will know: ⊲ What is attention, especially the Bahdanau attention and Luong attention ⊲ What is a multi-head attention and how it is used in transformer models ⊲ How to build encoder and decoder of a transformer ⊲ How to combine the encoder and decoder to create a fully working transformer, and how to train it ⊲ How to use transformer for real-world tasks From here you can go deeper to investigate other transformer models, for natural language or for computer vision. You also understand how and what a transformer does. Therefore, you can download some pre-trained models and use them for various tasks. To get the very most from this book, we recommend following each chapter and build upon them. Attempt to improve the results or the implementation. Write up what you tried or learned and share it on your blog, social media or send us an email at jason@MachineLearningMastery.com. Summary This book is a bit different from our other books from MachineLearningMastery.com in the sense that there are not a lot of small projects to work with. Instead, this entire book is one big project, to build a transformer model and apply it to NLP. A big project has many small components. By doing this project, you will learn a lot of ideas. Hope this will be eye-opening for you and bring you to a different level of deep learning. We are excited for you. Take your
xiv time, have fun and we’re so excited to see where you can take this amazing new technology to. Next Let’s dive in. Next up is Part I where you will learn the foundation of attention.
IFoundations of Attention
1What Is Attention? Attention is becoming increasingly popular in machine learning, but what makes it such an attractive concept? What is the relationship between attention applied in artificial neural networks and its biological counterpart? What components would one expect to form an attention-based system in machine learning? In this chapter, you will discover an overview of attention and its application in machine learning. After completing this chapter, you will know: ⊲ A brief overview of how attention can manifest itself in the human brain ⊲ The components that make up an attention-based system and how these are inspired by biological attention Let’s get started. Overview This chapter is divided into two parts; they are: ⊲ Attention ⊲ Attention in Machine Learning 1.1 Attention Attention is a widely investigated concept that has often been studied in conjunction with arousal, alertness, and engagement with one’s surroundings. “ In its most generic form, attention could be described as merely an overall level of alertness or ability to engage with surroundings. ”— “Attention in Psychology, Neuroscience, and Machine Learning”, 2020 Visual attention is one of the areas most often studied from both the neuroscientific and psychological perspectives. When a subject is presented with different images, the eye movements that the subject performs can reveal the salient image parts that the subject’s attention is most attracted to. In their review of computational models for visual attention,
1.1 Attention 3 Itti et al. (2001) mention that such salient image parts are often characterized by visual attributes, including intensity contrast, oriented edges, corners and junctions, and motion. The human brain attends to these salient visual features at different neuronal stages. “ Neurons at the earliest stages are tuned to simple visual attributes such as intensity contrast, colour opponency, orientation, direction and velocity of motion, or stereo disparity at several spatial scales. Neuronal tuning becomes increasingly more specialized with the progression from low-level to high-level visual areas, such that higher-level visual areas include neurons that respond only to corners or junctions, shape-from-shading cues or views of specific real-world objects. ”— “Computational Modelling of Visual Attention”, 2001 Interestingly, research has also observed that different subjects tend to be attracted to the same salient visual cues. Figure 1.1: Visual attention is to find the salient image parts. From “Computational Modelling of Visual Attention” Research has also discovered several forms of interaction between memory and attention. Since the human brain has a limited memory capacity, then selecting which information to store becomes crucial in making the best use of the limited resources. The human brain does so by relying on attention, such that it dynamically stores in memory the information that the human subject most pays attention to.
1.2 Attention in Machine Learning 4 1.2 Attention in Machine Learning Implementing the attention mechanism in artificial neural networks does not necessarily track the biological and psychological mechanisms of the human brain. Instead, it is the ability to dynamically highlight and use the salient parts of the information at hand — in a similar manner as it does in the human brain — that makes attention such an attractive concept in machine learning. Think of an attention-based system consisting of three components: “ 1. A process that “reads” raw data (such as source words in a source sentence), and converts them into distributed representations, with one feature vector associated with each word position. 2. A list of feature vectors storing the output of the reader. This can be understood as a “memory” containing a sequence of facts, which can be retrieved later, not necessarily in the same order, without having to visit all of them. 3. A process that “exploits” the content of the memory to sequentially perform a task, at each time step having the ability put attention on the content of one memory element (or a few, with a different weight). ”— Page 491, Deep Learning, 2016 Let’s take the encoder-decoder framework as an example since it is within such a framework that the attention mechanism was first introduced. If we are processing an input sequence of words, then this will first be fed into an encoder, which will output a vector for every element in the sequence. This corresponds to the first component of our attention-based system, as explained above. A list of these vectors (the second component of the attention-based system above), together with the decoder’s previous hidden states, will be exploited by the attention mechanism to dynamically highlight which of the input information will be used to generate the output.
1.2 Attention in Machine Learning 5 Figure 1.2: Sequence of words fed into encoder will output a vector for every element in the sequence. From “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” At each time step, the attention mechanism then takes the previous hidden state of the decoder and the list of encoded vectors, using them to generate unnormalized score values that indicate how well the elements of the input sequence align with the current output. Since the generated score values need to make relative sense in terms of their importance, they are normalized by passing them through a softmax function to generate the weights. Following the softmax normalization, all the weight values will lie in the interval [0, 1] and add up to 1, meaning they can be interpreted as probabilities. Finally, the encoded vectors are scaled by the computed weights to generate a context vector. This attention process forms the third component of the attention-based system above. It is this context vector that is then fed into the decoder to generate a translated output. “ This type of artificial attention is thus a form of iterative re-weighting. Specifically, it dynamically highlights different components of a pre-processed input as they are needed for output generation. This makes it flexible and context dependent, like biological attention. ”— “Attention in Psychology, Neuroscience, and Machine Learning”, 2020 The process implemented by a system that incorporates an attention mechanism contrasts with one that does not. In the latter, the encoder would generate a fixed-length vector irrespective of the input’s length or complexity. In the absence of a mechanism that highlights the salient information across the entirety of the input, the decoder would only have access to the limited information that would be encoded within the fixed-length vector. This would potentially result in the decoder missing important information. The attention mechanism was initially proposed to process sequences of words in machine translation, which have an implied temporal aspect to them. However, it can be generalized to process information that can be static, and not necessarily related in a sequential fashion, such as in the context of image processing. You will see how this generalization can be achieved in other chapters.