How GPT Works (MEAP V01) = How Large Language Models Work (Drew Farris, Edward Raff etc.) (Z-Library)

(This page has no text content)

How GPT Works 1. welcome 2. 1_Big_Picture,_What_is_GPT? 3. 2_Tokenizers:_How_Large_Language_Models_See_The_World 4. 3_Transformers:_How_Inputs_Become_Outputs

welcome Thank you for purchasing the MEAP for How GPT Works. We are well aware that the landscape of Large Language Models (LLMs) and Generative AI is changing at a rapid pace; we’ve contributed to the problem ourselves with our research! That's why we wanted to write this book on understanding what GPTs and LLMs are. Once you are done, the errors and successes you see with GPTs will make sense, and you’ll be able to reason about them and design better systems using LLMs. On the bigger picture, you’ll also have some insight into human biases that might impact your LLMs, as well as the ethics of designing systems, “Artificial General Intelligence”, and what makes GPT different from humans. Although there will not be any coding in the book itself, having some coding experience will help you understand the examples we use. Do not worry about knowing any particular math, artificial intelligence, machine learning, or deep learning. We’ve strived to make the content self-contained from that perspective and help you develop a useful understanding and intuition for what's happening. The book or organized into two themes. The first half of the book focuses on “how do LLMs work”. This will go step by step from the text you type in, through the algorithms, how they are built, and how new text comes out the other end. It is designed to be timeless compared to the code and libraries in use today because of how rapidly APIs are adapting, but the core fundamentals are more stable. This is because the technology behind LLMs are actually several years old now, but steadily improving each year. The second half of the book focuses on the bigger questions. How is GPT different from a human? How do you design a system well that will interact with humans in an automated fashion or augment them? Is it ethical to do these automations, and are we going to destroy the world via “AGI”? Some of these issues indeed have no one true answer, but by the end, you’ll have a strong mental framework to think through and argue these issues.

These are all complex topics; we’ve spent years on them and still get tripped up ourselves. So, throughout each chapter, we’ll also introduce a bit of dry humor. It is a book, after all, wet humor would result in soggy pages and printing issues. With that, please be sure to let us know any questions or suggestions that come up as you read the book in the liveBook discussion forum. It’s challenging to make a book with this scope, and we want to earnestly incorporate as much feedback as possible —Drew Farris, Edward Raff and Stella Biderman for Booz Allen Hamilton In this book welcome 1 Big Picture, What is GPT? 2 Tokenizers: How Large Language Models See The World 3 Transformers: How Inputs Become Outputs

1 Big Picture, What is GPT? This chapter covers Introducing How GPT Works in Plain Language. How do humans and machines represent languages differently? What are Generative Pretrained Transformers and Large Language Models? Why does ChatGPT perform so well? What are the limitations and concerns when using ChatGPT? The hype around terms like machine learning (ML), deep learning (DL), and artificial intelligence (AI) has reached record levels. Much of the recent public exposure to these terms has been driven by a product called ChatGPT, built by a company called OpenAI. Seemingly overnight, the ability of computers to talk, learn, and perform complex tasks has taken a dramatic leap forward. Companies are forming, and existing firms publicly invest billions of dollars. The technology in this space is evolving at a maddening pace. This book aims to help you make sense of this new world by dispelling the mystery behind what makes ChatGPT and related technologies work. We will cover the knowledge necessary to understand the inner workings of ChatGPT and how the components (data and algorithms) stack together to create the tools we use. We’ll also discuss the variety of cases where this technology can form the cornerstone of a broader system and others where systems like ChatGPT may be a poor choice. After reading this book, you’ll come away with an understanding of what ChatGPT really is, what it can and can’t do, and, importantly, the “why” behind its limitations. with this knowledge, you’ll be a more effective consumer of this family of technology, either as a user or a software developer. This foundation will also serve as a launchpad for deeper study into the field by providing knowledge that will allow you to understand in- depth research and other works.

First, we need to get more specific about what we are discussing. People have called ChatGPT a form of generative AI. Broadly, generative AI is software capable of creating various media (e.g., text, images, audio, and video) based on data it has observed in the past influenced by information about what people consider pleasing. For example, if ChatGPT is prompted with "write a haiku about snow falling on pines," ChatGPT will use all of the data it was trained with about haikus, snow, pines, and other forms of poetry to generate a novel haiku as shown in Figure 1.1 Figure 1.1 A simple Haiku generated by ChatGPT Fundamentally, ChatGPT is an ML model that generates new output, so generative AI is an appropriate description. Some examples of this are demonstrated in Figure 1.2. While ChatGPT deals primarily with text as input and output, it also has more experimental support for different data types. However, from our definition, you can imagine that many different kinds of algorithms and tasks fall into the description of generative AI. Figure 1.2 Generative AI/ML is about taking some input (numbers, text, images) and producing a new output (usually text or images). Any combination of input/output options is possible, and the nature of the output depends on what the algorithm was trained for. It could be to add detail, re- write something to be shorter, extrapolate missing portions, and more.

Going a level deeper, ChatGPT is dealing with human text, and so it would also be fair to call it a model of human language — or a language model if you are a cool person who does work in the field known as natural language processing (NLP). The field of NLP intersects both computer science and linguistics and explores the technology that helps computers understand, manipulate, and create human language. Some of the first efforts in the field of NLP emerged in the 1940s when researchers hoped to build machines that could automatically translate between languages. As a result, NLP and language models have been around for a very long time. So, what makes ChatGPT different? The most salient difference is that ChatGPT and similar algorithms are much larger than what people have historically built and trained on similarly larger amounts of data. For this reason, the name Large Language Models (LLMs)[1] has become quite popular[2]. A diagram of these relationships can be seen in Figure 1.3. ChatGPT is one of many products that operate via text and are built using LLMs. LLMs use techniques from AI and NLP, and the primary component of an LLM is a Transformer, which we will explain in Chapter 3. Figure 1.3 A high-level map of the various terms you’ll become familiar with and how they relate. Generative AI is a description of functionality: the function of generating content and using techniques from AI to accomplish that goal.

NOTE Vision and language are not the only options for generative AI. Audio generation (think text-to-speech when your GPS speaks out the street names), playing board games like chess, and even protein folding have used generative AI. This book will stick mostly to text and language since we all know how to read (otherwise, why did you buy this?). As the name large implies, these models are not small. ChatGPT specifically is rumored[3] to contain 1.76 trillion parameters. Each parameter is typically stored as a floating point value of four bytes in size. That means the model itself takes seven terabytes to hold in memory. This is larger than what most people’s computers could fit in RAM, let alone inside a Graphics Processing Unit (GPU) with 80 gigabytes of memory (for high-end units). GPUs are special-purpose hardware components that excel in performing the mathematical operations that make LLMs possible. Currently, many GPUs are required when making LLMs, so we are already discussing a lot of computational infrastructure and complexity over multiple machines to build an LLM. In contrast, more run-of-the-mill language models would be 2 GB or less in most cases, over 5,000Ừ smaller, a much more reasonable number. What you will learn Throughout this book, we will explain how LLMs work and equip you with the vocabulary needed to understand them. Once you’ve finished, you will have a conversational understanding of what an LLM is and the critical steps involved in its operation. Additionally, you will have some perspective on what an LLM reasonably can do, especially as you consider deploying or using one. We will discuss salient points about the fundamental limitations of

LLMs and provide tips on how to design around them or when LLMs and, more broadly, Generative AI should be avoided entirely. Who this book is for Due to the broad impact that ChatGPT and LLMs, more broadly, will have on the world, we’re purposely focusing on a broad audience for this book. Programmers of all backgrounds, executives, managers, sales staff, artists, writers, publishers, and many more will have to interact with or have their jobs impacted by LLMs over the coming years. So we are going to assume you, dear reader, have a minimal coding background but are familiar with the basic constructs of coding: logic, functions, and maybe even some data structures. You also do not need to be a mathematician; we will show you a bit of math where it is helpful, but it will be optional. This means that very little code will be presented in this book, and if your goal is to dive directly into using an LLM yourself, this is not the book you are looking for. But suppose you want to understand why the LLM you are using has unusual outputs, how your team might be able to use an LLM, where to avoid using an LLM, or have a colleague with little machine learning background who needs to get conversationally competent. This is the book you need. In particular, the first half of this book will focus on what LLMs do: their inputs and outputs, the process of converting inputs to outputs, and how we constrain the nature of those outputs. The second half of this book will focus on what humans do: how people interact with technology and what kinds of risks this creates for using generative AI. Similarly, we’ll consider some ethical considerations in using and building these LLMs. Training LLMs is expensive Training an LLM is not realistically possible for most people; it is a ≥ $100,000 investment at a minimum and would be a hundred-million-dollar effort to try and compete with OpenAI. At the same time, many resources are evolving every month. So we have decided to focus on content with a shelf life, helpful knowledge that (we think) will be valid years from now, where a code example could be out-of-date in just a few months.

1.1 Introducing How ChatGPT Works Generative Artificial Intelligence is poised to change the way that we produce and interact with information. The introduction of ChatGPT in November 2022 highlighted the capabilities of modern AI and fascinated a significant portion of the world. Indeed, you can sign up for free at https://chat.openai.com/ to try it out. If you enter the text prompt of “Summarize the following text in two sentences” followed by all of the introductory text from this chapter, you will get something similar to the following: “The recent surge in attention towards artificial intelligence, particularly large language models (LLMs) like ChatGPT from OpenAI, has highlighted their vast capabilities in natural language processing. This book aims to provide readers with a conversational understanding of LLMs, their operational intricacies, potential applications, limitations, and the ethical considerations surrounding their use while assuming only a basic familiarity with coding concepts and minimal mathematical background.” That’s pretty impressive, and to a casual audience, it may seem like this capability has come out of nowhere. When you sign up, you may notice the option shown in Figure 1.4. Figure 1.4 When you sign up for OpenAI’s ChatGPT, you have two options: the GPT-3.5 model that you can use for free or the GPT-4 model that costs money.

As GPT-4 implies, Open AI is currently working on its 4th generation of GPT models. LLMs like GPT-4 are a well-established area of ML research into creating algorithms that can synthesize and react to information and produce outputs that (now seem to match) human fidelity. This unlocks several areas of interaction between people and machines that previously existed only in science fiction. The strength of the language representation encoded into ChatGPT enables convincing dialog, instruction following, summary generation, question answering, content creation, and many more applications. Indeed, it is likely that many possible applications of this technology do not yet exist because our gut reaction is to think of our current problems rather than new capabilities or products that could exist. The critical factor for you, the reader, is that this did not come out of nowhere but has been a steady (yet rapid) march over the past decade of dramatic year- over-year improvements in machine learning. This means we already know quite a lot about how LLMs work and ways that they can fail. Since we are assuming a minimal background so that you can give this book to your friends and family[4], there is a potentially large gap in the background that we need to cover before we dive in. This first chapter aims to give you that background, so the next chapter can begin the process of answering: “How on earth did a computer summarize the introduction of this book?” 1.1.1 What is “intelligence” anyway?

The name “Artificial Intelligence” is an excellent name from a marketing perspective, though it is the original name for a whole academic field of research. It has led to a subtle problem that gives people a false mental model of how AI/ML works. We are going to carefully try to avoid reinforcing this model. To explain why this is, we will discuss why “Artificial Intelligence” is not such a great name. This is easy to show with the simple question “What is intelligence?”. You might think that an Intelligence Quotient (IQ) test is a good answer to that question. IQ tests have a strong correlation with numerous outcomes like school performance, but that does not give us a definition of intelligence that is objective. Studies show that some amount of nature (hereditary) and nurture (environment) impact a person’s IQ. It should also seem suspicious that we can boil down intelligence into something as simple as just one number — after all, we often scold people for being only “book smart” but not “street smart”. Even if we knew what intelligence was, what would make it artificial? Does intelligence have artificial flavorings and food colorings? The bottom line is that IQ tests measure your ability to perform a finite set of capabilities, mostly some specific types of logic puzzles under time constraints. The purpose of this short discussion on definitions is to highlight that the field of AI has long been trying to get computers (rigid, deterministic, rule- following machines) to perform specific tasks that humans can do but simultaneously can’t give precise definitions or instructions to. For example, if we want a computer to count to 1000 and print out every number divisible by 5, we can write detailed instructions that almost any programmer can convert to code. But if I ask you to write a program that detects if an arbitrary picture has a cat in it, that’s quite a different challenge. You need to somehow precisely define what a cat is and then all the minutia of how to detect one. How exactly do we write code to find and differentiate between cat whiskers and dog whiskers? How do we successfully recognize a cat when it does not have whiskers? When it comes down to it, it isn’t easy to do. However, because AI and ML have focused on these hard-to-specify capabilities that humans have, describing AI and ML algorithms using analogies has become especially common. To get a computer to detect cats,

we provide thousands upon thousands of examples of images that are cats and images that are not cats. We then run one of many various algorithms with a specific, detailed, mathematical process for differentiating cats from the rest of the world. But in the technical vocabulary, we call this process learning. When the model fails to detect a cat in a new image because it is a lion and lions were not in the original list of cats, we often say that the model didn’t understand lions. Indeed, whenever we try to explain anything to friends, we often use analogies to shared concepts that both people are familiar with. Because AI/ML is broadly focused on replicating human abilities to perform tasks, the analogies often use language that implies the literal cognitive functions of a human. As LLMs demonstrate capabilities at a level close to what humans can do, these analogies become more troublesome than helpful because people read too deeply into them and begin to believe that they mean more than they do. For this reason, we will be very careful with our analogies and caution the reader about following any analogies too far. Some terms, like learning, are technical jargon worth understanding, but we want you to be on your guard about what they might imply. There will be cases where analogies are still helpful in this book, but we will try to be explicit about the boundaries of how far to interpret such analogies. 1.2 How humans and machines represent language differently What does it mean to “represent language”? We humans learn this implicitly, shortly after birth, through our interaction with others and the world around us. We proceed through formal education to develop an understanding of the components, underlying structures, and rules that govern its use. Our internal representation of language has been studied extensively. While some language laws have been uncovered, many are still up for debate. ChatGPT’s internal representation of language is based on portions of this knowledge. It is enabled using the concepts of artificial neural networks [5], which are combinations of data structures and algorithms that are patterned loosely after

structures of the human brain. However, our understanding of the ways the mind works is incomplete. While the neural networks that power ChatGPT are a mere simplification of human brain function, their ability to capture and encode language in a useful way to generate language and interact with people is where their power lies. NOTE Abstractions of the brain’s structure have proven useful across many domains. Neural networks have demonstrated incredible progress not only in the areas of language but also in vision, learning, and pattern recognition. The convergence of advancements in neural machine learning algorithms, the extreme proliferation of digital data, and an explosion of computer hardware, such as GPUs, have led to the advancements that make ChatGPT possible today. The critical detail to take from this is that you, as a human, have some innate form of language you have learned over time. Your learning and use of language are interactive. Through evolution, we all seem to have relatively consistent ways of learning and communicating with each other [6]. Unlike people, LLMs have a representation of language that is learned via a static process. When you have a conversation with ChatGPT, it mechanically participates in a dialog with you despite having never been in a conversation before. The representation of language an LLM learns can be high quality, but it is not error-free. It is manipulable in that we can alter the behavior of LLMs in specific ways to limit what they are aware of or what they produce. Understanding that LLMs represent language that is inferred from other examples of language helps us bind our expectations to a realistic realm. If you are going to use an LLM, how dangerous is it if it is wrong? How can you work with the representation of language to build a product or avoid a bad outcome? These are some of the high-level items we will discuss throughout this book. 1.3 Generative Pretrained Transformers and friends

GPT stands for Generative Pretrained Transformer. GPT is a term invented by OpenAI to talk about a new type of model that they introduced in 2018 that incorporates a type of neural network component known as a Transformer. While the original GPT model (GPT-1) is no longer used, the core underlying ideas, like Transformers, have become a core pillar of the recent revolution in “generative artificial intelligence.” It is also essential to recognize that ChatGPT is just one highly visible example of an expansive domain of algorithmic research and application of LLMs. Outside of the release of ChatGPT, we have observed an incredible proliferation of LLMs. Some LLMs, like those released by EleutherAI and the BigScience Research Workshop, are freely available to the public to advance research and explore applications. Corporations like Meta, Microsoft, and Google have released other LLMs with more restrictive licensing terms. This has created a vibrant community of researchers, hobbyists, and companies exploring the applications, limitations, and opportunities created by LLMs and Generative AI. The concepts we teach in this book apply nearly uniformly to ChatGPT, its myriad of existing cousins, and new models emerging almost daily. Each of these can produce output using structures like those found in ChatGPT. It may seem impossible that one book can contain a general summary applicable to many different models. This is possible for a few reasons, one of the most important being that we will not go to the level of depth necessary to code an LLM yourself from scratch. Our scope and descriptions are intentionally generalized to the most common and easiest-to-explain forms of LLMs in use today. The second reason we can give such a broadly applicable summary is the nature of LLMs today: many tweaks can be made to their language representations, how they are constructed, and the process used to build a representation from data. Despite all of these knobs to tune, it is a frustratingly consistent finding that the two details that matter the most are: 1. How large is the model, and can you make it larger? 2. How much data was used to build the model, and can you get more? Researchers like to think they have vital insights or designs that meaningfully

improve how these LLMs work and operate. Still, in many cases, the same improvement could have been obtained just as easily by “making it bigger” instead. This will be a crucial component of many ethical concerns around using and building LLMs, which we will discuss in the book’s second half. 1.3.1 Why GPTs and LLMs perform so well We will discuss details about how LLMs work in the first half of this book, but it is also worth sharing a key lesson learned by researching ML algorithms. For many years, getting better performance from your algorithm for whatever task you were trying to do often meant getting clever about designing your algorithm. You would study your problem, the data, and the math and attempt to derive valuable truths about the world that you could encode into your algorithm. If you did a good job, your performance improved, you required less data, and all was good in the world. Many classic deep learning algorithms you may hear, like Convolutional Neural Networks (CNNs) and Long-Short Term Memory (LSTM) networks, are, at a high level, results from people thinking hard and getting clever. Even “shallow” ML algorithms like XGBoost were born from this fruitful approach to algorithm design. Figure 1.5 If the cleverness of an algorithm is based on how much information you encode into the design, older techniques often increase performance by being more clever than their predecessors. As reflected by the size of the circles, LLMs have mostly chosen a “dumber” approach of just using more data and parameters - imposing minimal constraints on what the algorithm can learn. LLMs demonstrate a more recent trend. Instead of getting clever about the

algorithm, we keep it simple and implement a “naive” algorithm about how the world works. In many ways, LLMs have fewer beliefs about the world forcibly baked into the algorithm. This provides more flexibility. How could this be a good idea if I told you the opposite approach was how people improved algorithms? The difference is that LLMs and similar techniques are just bigger, massively so, trained on far more data and with far more algorithms; this brute-force approach appears to have outpaced classic ML methods in performance. This idea is illustrated in Figure 1.5 As we have already stated, bigger is not better by every metric. These models are currently a logistical and computational challenge to deploy. Many real- world constraints, including response time, power draw, battery drain, and maintainability, are all negatively impacted. So, it is only a narrow definition of “performance” by which LLMs have improved. Still, the lesson on the value of “going bigger” over “getting clever” is worth considering. Sometimes, in your design of a machine learning solution, even if you are using an LLM, the best answer may be "let’s just go get a lot more data. 1.4 GPT in action: the good, bad, and scary Throughout this book, we will give examples of how GPT can fail, often in hilarious or silly ways. The point of these illustrations isn’t to say that GPT is incapable of performing a task. With changes to the input, setup, or random luck, you can often get GPT to work better. The point of such illustrations is to show you how GPT can fail, often on things so simple that a child can do them better. As you read through this book and interact with LLMs yourself, this should give you pause and lead you to the thought. “If I use GPT for a hard task, but it fails on easy ones, am I setting myself up for failure?”. The answer may often be an emphatic yes! Using LLMs safely requires a degree of skepticism or doubt about the outputs, your work to verify and validate correctness, and the ability to adapt accordingly. If you use GPT for a task you can not do yourself, you risk exposing yourself to errant results you can’t verify personally. We will continually weave this point and how to deal with it into the conversation as

we discuss how to use GPT more throughout the book. NOTE Chat-GPT can often be quite verbose, to put it politely. Our prompts will often include instructions to help GPT get right to the point so that you don’t have to read a wall of text. It is easy to imagine many ways that something like GPT could potentially make our lives easier when it does work—answering all your emails, summarizing long documents, and explaining new concepts. What does not come naturally to many is how things can go wrong and quickly become dangerous. This kind of adversarial thinking can often be prompted with an initial example: say you want to learn how to make a bomb. If you ask ChatGPT that question, you get the sanitized answer “Sorry, I can’t assist with that request. If you’re in crisis or need help, please reach out to local authorities or professionals who can help.” However, researchers have recently shown how to get ChatGPT and many other commercial LLMs to answer the question without hesitation, amongst many other dangerous requests for information[7]. One might argue that if someone is so clever as to figure out how to trick the LLM, they could probably get whatever dangerous information they want from another source. This is likely true but fails to account for the scale of automation. No AI/ML algorithm is perfect, and if millions of people are asking questions, LLMs produce a dangerous response 0.01% of the time, and ChatGPT has over 100 million users[8], that is 10,000 dangerous responses. The problem worsens when you consider what a malicious actor might begin to automate, another issue we will discuss further in the second half of this book. 1.5 Summary ChatGPT is one type of Large Language Model, which is itself in the larger family of Generative AI/ML. Generative models produce new

output, and LLMs are unique in the quality of their output but are extremely costly to make and use. ChatGPT is loosely patterned after an incomplete understanding of human brain function and language learning. This is used as inspiration in design, but it does not mean ChatGPT has the same abilities or weaknesses as humans. Intelligence is a multi-faceted and hard-to-quantify concept, making it difficult to say if LLMs are intelligent. It is easier to think about LLMs and their potential use in terms of capabilities and reliability. Human language must be converted to and from an LLM’s internal representation. How this representation is formed will change what an LLM learns and influence how you can build solutions using LLMs. [1] Another less popular name is “foundation models”, though we are not keen on this name ourselves. [2] Computer scientists are famous for their very blunt adjective forward naming strategies. [3] https://wandb.ai/byyoung3/ml-news/reports/AI-Expert-Speculates-on- GPT-4-Architecture---Vmlldzo0NzA0Nzg4 [4] One of the authors is hopeful that they can give this book to their mother, who is very proud of them even if she does not know precisely what their job is. [5] also known as deep learning, another dangerous analogy [6] For more on this, look into Noam Chomsky’s Universal Grammar [7] https://www.nytimes.com/2023/07/27/business/ai-chatgpt-safety- research.html [8] https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing- user-base-analyst-note-2023-02-01/

Statistics

Uploader

How GPT Works (MEAP V01) = How Large Language Models Work (Drew Farris, Edward Raff etc.) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Statistics

Uploader

How GPT Works (MEAP V01) = How Large Language Models Work (Drew Farris, Edward Raff etc.) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment