Transformers in Action (Nicole Koenigstein) (Z-Library)

M A N N I N G Nicole Koenigstein Foreword by Luis Serrano

2 EPILOGUE Transformer architecture variants and their core capabilities and use cases Architecture type Primary capabilities Typical use cases Encoder–decoder Sequence-to-sequence modeling Machine translation, summarization, question answering Decoder-only Autoregressive generation, instruction following Text generation, code synthesis, chat interfaces Encoder-only Representation learning, contextual embeddings Text classification, semantic search, entity recognition Embedding models Learning dense or sparse vector representations Retrieval-augmented generation, similarity search, clustering MoE Sparse expert routing, scalable compute efficiency Efficient large-scale generation, multitask learning Output y + L n-times decoder-only layers Add & Norm FFN Attention Linear h-times causal attention Input Single decoder-only layer Output y + L n-times encoder-only layers Add & Norm FFN Attention Linear h-times attention-head Input Single encoder-only layer Abstracted encoder-only architectureAbstracted decoder-only architecture The Ecosystem of Large Language Model Space Licensed to Damian Allen <manning@pixerati.com>

Transformers in ActionLicensed to Damian Allen <manning@pixerati.com>

iiLicensed to Damian Allen <manning@pixerati.com>

Transformers in Action NICOLE KOENIGSTEIN FOREWORD BY LUIS SERRANO M A N N I N G SHELTER ISLAND Licensed to Damian Allen <manning@pixerati.com>

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2026 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The author and publisher have made every effort to ensure that the information in this book was correct at press time. The author and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Manning Publications Co. Development editor: Marina Michaels 20 Baldwin Road Technical editors: David Pacheco Asnar and PO Box 761 Mike Erlihson Shelter Island, NY 11964 Review editor: Kishor Rit Production editor: Keri Hales Copy editor: Kari Lucke Proofreader: Olga Milanko Technical proofreader: Karsten Strøbæk Typesetter and cover designer: Marija TudorISBN 9781633437883 Printed in the United States of America Licensed to Damian Allen <manning@pixerati.com>

brief contents PART 1 FOUNDATIONS OF MODERN TRANSFORMER MODELS ......... 1 1 ■ The need for transformers 3 2 ■ A deeper look into transformers 11 PART 2 GENERATIVE TRANSFORMERS ........................................ 35 3 ■ Model families and architecture variants 37 4 ■ Text generation strategies and prompting techniques 56 5 ■ Preference alignment and retrieval-augmented generation 80 PART 3 SPECIALIZED MODELS .................................................. 103 6 ■ Multimodal models 105 7 ■ Efficient and specialized small language models 129 8 ■ Training and evaluating large language models 155 9 ■ Optimizing and scaling large language models 182 10 ■ Ethical and responsible large language models 203 references 227 index 231 v Licensed to Damian Allen <manning@pixerati.com>

contents foreword x preface xii acknowledgments xiv about this book xvi about the author xix about the cover illustration xx PART 1 FOUNDATIONS OF MODERN TRANSFORMER MODELS ........................................................... 1 1 The need for transformers 3 1.1 The transformers breakthrough 4 Translation before transformers 4 ■ How are transformers different? 5 ■ Unveiling the attention mechanism 6 The power of multihead attention 7 1.2 How to use transformers 8 1.3 When and why to use transformers 8 1.4 From transformer to LLM: The lasting blueprint 9 2 A deeper look into transformers 11 2.1 From seq-2-seq models to transformers 12vi The difficulty of training RNNs 12 ■ Introducing attention mechanisms 13 ■ Vanishing gradients: Transformer to the Licensed to Damian Allen <manning@pixerati.com>

CONTENTS vii rescue 14 ■ Exploding gradients: When large gradients disrupt training 15 2.2 Model architecture 15 Encoder and decoder stacks 17 ■ Positional encoding 19 Attention 21 ■ Position-wise FFNs 32 PART 2 GENERATIVE TRANSFORMERS ......................... 35 3 Model families and architecture variants 37 3.1 Decoder-only models 38 3.2 The decoder-only architecture 38 3.3 Encoder-only models 43 Masked language modeling as a pretraining strategy 45 3.4 Embedding models and RAG 46 What is an embedding? 47 3.5 MoE in LLMs 50 How MoE works 50 4 Text generation strategies and prompting techniques 56 4.1 Decoding and sampling methods for text generation 57 Greedy search decoding for text generation 57 ■ Beam search decoding for text generation 59 ■ Top-k sampling for text generation 62 ■ Nucleus sampling for text generation 64 Temperature sampling for text generation 66 4.2 The art of prompting 68 Zero-shot prompting 69 ■ One- and few-shot prompting 70 CoT prompting 71 ■ Structured CoT with Instructor 72 Contrastive CoT prompting 73 ■ CoVe prompting 74 ToT prompting 75 ■ ThoT prompting 78 5 Preference alignment and retrieval-augmented generation 80 5.1 Reinforcement learning from human feedback 81 From MDP to reinforcement learning 81 ■ Improving models with human feedback and reinforcement learning 83 5.2 Aligning LLMs with direct preference optimization 85 The SFT step 86 ■ Training the LLM with DPO 89 Running the inference on the trained LLM 92 ■ Optimized versions for DPO 92 ■ Group Relative Policy Optimization 93 Licensed to Damian Allen <manning@pixerati.com>

CONTENTSviii 5.3 MixEval: A benchmark for robust and cost-efficient evaluation 95 5.4 Retrieval-augmented generation 96 A first look at RAG 96 ■ Why and when to use RAG 97 Core components and design choices 98 PART 3 SPECIALIZED MODELS ................................... 103 6 Multimodal models 105 6.1 Getting started with multimodal models 106 6.2 Combining modalities from different domains 107 6.3 Modality-specific tokenization 108 Images and visual embeddings 109 ■ Image analysis with an MLLM 111 ■ From image patches to video cubes 113 Video information extraction 114 ■ Audio embeddings 116 Audio-only pipeline: Extraction and inference 117 6.4 Multimodal RAG: From PDF to images, tables, and cross-model comparison 121 7 Efficient and specialized small language models 129 7.1 The power of small 130 7.2 Small models as agents in a system of specialists 131 7.3 Classification with SLMs 133 Evaluating classification performance 133 ■ Accuracy and the F1-score 135 ■ Fine-tuning SLMs on the Financial PhraseBank dataset 135 7.4 Adapting Gemma 3 270M for empathy and prosocial tone 140 7.5 Adapting Gemma 3 270M for English–Spanish translation 147 7.6 Broader use cases and complementary models 151 8 Training and evaluating large language models 155 8.1 Deep dive into hyperparameters 156 How parameters and hyperparameters factor into gradient descent 156 8.2 Model tuning and hyperparameter optimization 158 Tracking experiments 163 Licensed to Damian Allen <manning@pixerati.com>

CONTENTS ix 8.3 Parameter-efficient fine-tuning LLMs 167 Low-rank adaptation 168 ■ Weight-decomposed low-rank adaptation 171 ■ Quantization 172 ■ Efficient fine-tuning of quantized LLMs with QLoRA 175 ■ Quantization-aware low-rank adaptation 178 ■ Low-rank plus quantized matrix decomposition 179 ■ Bringing it all together: Choosing the right PEFT strategy 180 9 Optimizing and scaling large language models 182 9.1 Model optimization 183 Model pruning 183 ■ Model distillation 184 9.2 Sharding for memory optimization 186 9.3 Inference optimization 188 9.4 GPU-level optimization: Tiling, threads, and memory 192 FlashAttention: Tiled attention at scale 195 9.5 Extending long-context windows 197 Rotary embeddings and refinements 198 ■ Refinements: YaRN, positional interpolation, and iRoPE 199 10 Ethical and responsible large language models 203 10.1 Understanding biases in LLMs 204 Identifying bias 204 ■ Model interpretability and bias in AI 206 10.2 Transparency and explainability of LLMs 207 Using Captum to analyze the behavior of generative language models 207 ■ Using local interpretable model-agnostic explanations to explain a model prediction 212 10.3 Responsible use of LLMs 214 The foundation model transparency index 216 10.4 Safeguarding your language model 216 Jailbreaks and lifecycle vulnerabilities 222 ■ Shielding your model against hazardous abuse 223 references 227 index 231Licensed to Damian Allen <manning@pixerati.com>

foreword Transformers and the large language models they made possible, sit at the center of modern AI. They mark one of those rare moments when an elegant theoretical idea meets enormous real-world effects. If specialized hardware is the body of modern computation, transformers are the mind. They are the part that learns, reasons, and creates. Almost every major AI breakthrough we see today—from smart code genera- tion to instant translation and conversational assistants—traces back to a single idea: attention, and the incredible parallelism it unlocked. If you work with AI today, flu- ency in the language of transformers is no longer optional. It is essential. But keeping up with this field is no small task. Every few weeks, a new architecture, prompting method, or scaling technique seems to appear. Even experts can find it hard to keep track of what really matters. That is why a book like Transformers in Action feels so timely and valuable. It does not just explain how transformers work. It helps you understand them. It builds the kind of intuition that lets you see these models not as mysterious black boxes but as systems you can reason about, adapt, and improve. That is exactly the kind of understanding Nicole Königstein brings to life. Nicole has a rare combination of deep theoretical knowledge and real-world expe- rience. She has led AI teams, designed quantitative systems in finance, and worked on large-scale deployments where precision and reliability are everything. As a PhD researcher in AI and leadership roles as chief data scientist, head of Quant AI Research, and consultant, she bridges two worlds that do not always meet easily: the clarity of theory and the pragmatism of production. She knows that building success- ful AI systems means more than mastering the math. It is about balancing innovationx with responsibility and technical excellence with good judgment. Licensed to Damian Allen <manning@pixerati.com>

xi The book reflects that balance beautifully. It starts with the foundations, explain- ing why attention was such a breakthrough and how transformers changed the way we think about sequence modeling. From there, it moves into the generative era, explor- ing advanced prompting, preference alignment for safety, and techniques like retrieval-augmented generation that keep models grounded in truth. Later chapters take on the challenges of production, from multimodal systems and efficient small language models to optimization methods such as PEFT and LoRA. It all comes together in a final discussion on ethics and responsibility, an essential topic for anyone shaping the future of AI. Transformers in Action manages something special. It is rigorous but never dry, deep but always clear. Nicole makes complex ideas feel intuitive and gives readers the confi- dence to move from simply using AI tools to truly building with them. In a field that moves faster than ever, this book is a calm, reliable guide that will leave you not just informed but inspired. —LUIS SERRANO, PHD FOUNDER AND CEO OF SERRANO ACADEMY AND AUTHOR OF GROKKING MACHINE LEARNINGLicensed to Damian Allen <manning@pixerati.com>

preface When I first started using transformers in 2019, I was immediately hooked. Two years later, I built my own deep learning architecture using attention. That work was later published in a Springer Nature journal, and the experience convinced me that trans- formers would be transformative, literally speaking. What struck me most was not their complexity but their simplicity. The mechanism that unlocked the transformer revolution is not complex mathematics. It’s built on linear algebra fundamentals: mul- tiplying matrices, normalizing with softmax, and combining vectors with weighted sums. It’s remarkable that from a foundation of dot products and probabilities we arrived at systems with billions of parameters that can reason across text, images, audio, and video. That’s the story of transformers: one elegant mechanism, applied at scale, reshaping the landscape of AI. This book focuses on that story—from the ori- gins of transformers to how we can now use large language models (LLMs) and multi- modal systems in practice. The elegance lies in how those simple steps are arranged and combined. Each token is projected into queries, keys, and values. The model computes dot products between queries and keys to decide relevance, applies softmax to turn those scores into probabilities, and uses them to form weighted sums over the values. If you think about it, this is not so different from what happens during text genera- tion itself. When a model predicts the next token, it once again applies softmax to produce probabilities and then samples from them to decide what comes next. Both mechanisms rely on basic probability. That’s why you don’t need to be a mathemati-xii cian to understand transformers. Their foundations are accessible, and the real won- der comes from how much power emerges from such simple operations. Licensed to Damian Allen <manning@pixerati.com>

PREFACE xiii The pace of innovation with this architecture is breathtaking. “Attention Is All You Need” in 2017 first applied transformers to translation tasks. BERT showed the power of pretraining and fine-tuning. What started with translation has now scaled into bil- lion-parameter LLMs, with ChatGPT bringing transformers into everyday awareness and models like DeepSeek, pushing efficiency and scale to new frontiers. With contin- uous innovations like FlashAttention, all those matrix multiplications have become faster and more efficient. So why did I decide to write this book? When I first began studying machine learn- ing and deep learning, most of the books I encountered relied on toy examples. They were fine for illustrating concepts, but those same examples often broke down when applied to real-life data. I wanted to approach this differently, and I wanted to bring my passion for teaching onto paper. To help the next generation of data scientists and machine learning engineers, I build on my knowledge by giving them not only a solid foundation but also the hands-on guidance needed to make transformers work in practice. Throughout this book, you’ll follow both the evolution of transformers and my personal journey with them through LLMs, while building your own path and under- standing how to move forward in this field. The book begins with the foundations of attention and then traces how transformers evolved into the generative and multi- modal systems we know today. Along the way, it explores efficiency, scaling strategies, and the responsibilities that come with deploying such powerful models. I hope that as you read through the book, you’ll see both the beauty of the under- lying simplicity and the extraordinary possibility that grows from it. Licensed to Damian Allen <manning@pixerati.com>

acknowledgments Writing a book or having a career, especially in a field such as AI, is never a solitary endeavor, even when much of the work happens in quiet hours of research, coding, and drafting. I want to take a moment to thank the people who supported me at key points along this journey. First, I would like to thank Markus Oehmann, who encouraged and supported me at the very beginning, when I first started out in AI. Although our lives ultimately took different paths, his support in those early years gave me the confidence to follow research as my true fulfillment. For that, he deserves his place here. I am deeply grateful to Prof. Dr. Christoph Denzler, who believed in me early on, bent the rules when needed, and supported my first thesis. That thesis ultimately set me on the path that led to this career. His support provided the foundation for my first published paper and for everything that followed. I also want to thank Luis Serrano for generously writing the foreword and for being such a remarkable educator whose work continues to make AI accessible and inspiring for a wide audience. Thanks to all the reviewers: Al Pezewski, Aleksandar Babic, Ali Shakiba, Animikh Aich, Ankit Virmani, Anton Petrov, Anup Parikh, Arturo Geigel, Bruno Couriol, Chunxu Tang, David Curran, Dhirendra Choudhary, Fernando Bayon, George Gaines, Hobson Lane, Jakub Langr, Jakub Morawski, James Liu, Jeremy Chen, Jeremy Zeidner, John Williams, Mark Liu, Martin Hediger, Matthew Sharp, Maureen Metzger, Naveen Achyuta, Olena Sokol, Paul Silisteanu, Philipp Dittrich, Pradeep Saraswati, Priyanka Neelakrishnan, Raj Kumar, Ravesh Sharma, Richard Meinsen, Ross Turner, Rui Liu,xiv Sameet Sonawane, Sidharth Somanathan, Simon Tschöke, Simone De Bonis, Simone Sguazza, Sri Ram Macharla, Subhankar Ray, Sukanya Konatam, Tony Holdroyd, Vahid Licensed to Damian Allen <manning@pixerati.com>

ACKNOWLEDGMENTS xv Mirjalili, Vidhya Vinay, Vinoth Nageshwaran, Vybhavreddy Kammireddy Chan- galreddy, Walter Alexander Mata López, Wei-Meng Lee. Your suggestions helped make this a better book. I would like to thank all the staff at Manning who helped me with this book, espe- cially Marina Michaels for her attention to detail during the development process and to all the behind-the-scenes production team as well. Thanks also to the technical edi- tors, David Pacheco Aznar, computational mathematician and data scientist and Mike Erlihson, math PhD from the Technion, and to Karsten Strøbæk, technical proof- reader at Manning, who reviewed and tested all the code. Finally, I want to thank all the colleagues, students, and peers who have inspired me through discussions, collaborations, and shared passion for AI. Each of you has contributed, in ways big and small, to shaping the ideas and perspectives reflected in this book. Licensed to Damian Allen <manning@pixerati.com>

about this book Transformers in Action is a comprehensive guide to understanding and applying trans- former models in the language and multimodal space. These models are foundational to modern AI systems such as ChatGPT and Gemini. The book aims to provide you with a solid foundation to use these models for your own projects, starting with the core concepts of transformers and then moving to practical and more advanced appli- cations such as multimodal retrieval systems. You will learn why transformers are designed the way they are and how they work, giving you both the theoretical understanding and the hands-on skills to use them effectively. Along the way, you’ll see when to use small language models (SLMs) and when architectural choices such as encoder-only or decoder-only designs make more sense. Who should read this book This book is for data scientists and machine learning engineers who want to learn how to build and apply transformer-based models for language and multimodal tasks. The goal is to equip you with the essential knowledge to establish a strong foundation, so you can confidently move on to advanced models and approaches. How this book is organized: A road map The book is divided into three parts covering 10 chapters. Part 1 explains the founda- tions of transformer models:  Chapter 1 introduces the need for transformers, explains why earlier sequencexvi models struggled, and shows how the attention mechanism overcomes those limitations. Licensed to Damian Allen <manning@pixerati.com>

ABOUT THIS BOOK xvii  Chapter 2 explores the full architecture, including encoder and decoder stacks, positional encoding, attention layers, and feed-forward networks. Part 2 covers generative transformers:  Chapter 3 surveys major architectural variants, including decoder-only, encoder-only, embedding models, and mixture-of-experts.  Chapter 4 presents text generation strategies and prompting techniques, including greedy and beam search, top-k and nucleus sampling, temperature sampling, and prompting patterns ranging from zero-shot to tree-of-thought.  Chapter 5 focuses on preference alignment and retrieval-augmented genera- tion (RAG). It introduces reinforcement learning from human feedback, direct preference optimization, and robust evaluation methods, and shows how to build grounded systems with RAG. Part 3 explores specialized and advanced models:  Chapter 6 introduces multimodal models that combine text with images, audio, and video. It explains modality-specific tokenization, visual and audio embed- dings, and multimodal RAG for complex documents.  Chapter 7 discusses SLMs. You will see how SLMs can act as efficient specialists and walk through case studies on classification, translation, and fine-tuning for empathy and prosocial tone. This chapter also shows how SLMs can serve as agents in larger workflows.  Chapter 8 covers training and evaluating LLMs, including hyperparameters, experiment tracking, parameter-efficient fine-tuning, and quantization tech- niques such as QLoRA.  Chapter 9 focuses on optimization and scaling. It explains pruning, distillation, sharding, inference optimization, GPU-level efficiency, FlashAttention, and long-context extensions.  Chapter 10 addresses ethical and responsible AI. It covers bias detection, trans- parency and explainability tools, responsible deployment, and safeguards against jailbreaks and misuse. You can read the book cover to cover or begin with part 1 for the foundations and then jump to the topics most relevant to your work in parts 2 and 3. About the code This book is designed to provide both a strong theoretical foundation and practical skills. For that reason, it contains many examples of source code both in numbered listings and in line with normal text. In both cases, source code is formatted in a fixed- width font like this to separate it from ordinary text. Sometimes code is also in bold to highlight code that has changed from previous steps in the chapter, such as when a new feature is added to an existing line of code. Licensed to Damian Allen <manning@pixerati.com>

ABOUT THIS BOOKxviii In many cases, the original source code has been reformatted; we’ve added line breaks and reworked indentation to accommodate the available page space in the book. In rare cases, even this was not enough, and listings include line-continuation markers (➥). Additionally, comments in the source code have often been removed from the listings when the code is described in the text. Code annotations accompany many of the listings, highlighting important concepts. I recommend you use the Jupy- ter notebooks directly rather than copying code from the printed listings, since the original source code has been reformatted. This way you can more easily build on them as blueprints for your own applications. All source code is available in a dedicated GitHub repository at https://github .com/Nicolepcx/Transformers-in-Action. The repository is organized by chapters, with Jupyter notebooks that make the examples interactive and easy to extend. Each notebook includes an “Open in Colab” button so you can run the code directly. Some examples may require Colab Pro or a comparable GPU due to memory needs. You can get executable snippets of code from the liveBook (online) version of this book at https://livebook.manning.com/book/transformers-in-action. The complete code for the examples in the book is available for download from the Manning web- site at https://livebook.manning.com/books/transformers-in-action. liveBook discussion forum Purchase of Transformers in Action includes free access to liveBook, Manning’s online reading platform. Using liveBook’s exclusive discussion features, you can attach com- ments to the book globally or to specific sections or paragraphs. It’s a snap to make notes for yourself, ask and answer technical questions, and receive help from the author and other users. To access the forum, go to https://livebook.manning.com/ book/transformers-in-action/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/discussion. Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We sug- gest you try asking the author some challenging questions lest his interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print. Licensed to Damian Allen <manning@pixerati.com>

Statistics

Uploader

Transformers in Action (Nicole Koenigstein) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Recommended for You

Statistics

Uploader

Transformers in Action (Nicole Koenigstein) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Recommended for You