Statistics
6
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2026-03-25

AuthorAlireza Parandeh

Ready to build production-grade applications with generative AI? This practical guide takes you through designing and deploying AI services using the FastAPI web framework. Learn how to integrate models that process text, images, audio, and video while seamlessly interacting with databases, filesystems, websites, and APIs. Whether you're a web developer, data scientist, or DevOps engineer, this book equips you with the tools to build scalable, real-time AI applications. Author Alireza Parandeh provides clear explanations and hands-on examples covering authentication, concurrency, caching, and retrieval-augmented generation (RAG) with vector databases. You'll also explore best practices for testing AI outputs, optimizing performance, and securing microservices. With containerized deployment using Docker, you'll be ready to launch AI-powered applications confidently in the cloud. Build generative AI services that interact with databases, filesystems, websites, and APIs Manage concurrency in AI workloads and handle long-running tasks Stream AI-generated outputs in real time via WebSocket and server-sent events Secure services with authentication, content filtering, throttling, and rate limiting Optimize AI performance with caching, batch processing, and fine-tuning techniques

Tags
No tags
ISBN: 1098160304
Publisher: O’Reilly Media
Publish Year: 2025
Language: 英文
Pages: 531
File Format: PDF
File Size: 7.4 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

Alireza Parandeh Foreword by David Foster Building Generative AI Services with FastAPI A Practical Approach to Developing Context-Rich Generative AI Applications
ISBN: 978-1-098-16030-2 US $69.99 CAN $87.99 WEB DEVELOPMENT Ready to build production-grade applications with generative AI? This practical guide takes you through designing and deploying AI services using the FastAPI web framework. Learn how to integrate models that process text, images, audio, and video while seamlessly interacting with databases, filesystems, websites, and APIs. Whether you’re a web developer, data scientist, or DevOps engineer, this book equips you with the tools to build scalable, real-time AI applications. Author Alireza Parandeh provides clear explanations and hands-on examples covering authentication, concurrency, caching, and retrieval-augmented generation (RAG) with vector databases. You’ll also explore best practices for testing AI outputs, optimizing performance, and securing microservices. With containerized deployment using Docker, you’ll be ready to launch AI-powered applications confidently in the cloud. • Build generative AI services that interact with databases, filesystems, websites, and APIs • Manage concurrency in AI workloads and handle long-running tasks • Stream AI-generated outputs in real time via WebSockets and server-sent events • Secure services with authentication, content filtering, throttling, and rate limiting • Optimize AI performance with caching, batch processing, and fine-tuning techniques Alireza Parandeh is a chartered engineer with the UK Engineering Council and a Microsoft and Google certified developer, data engineer, and data scientist. Building Generative AI Services with FastAPI “A must-have for software developers and data scientists to learn to build production-grade generative AI services with FastAPI. Ali’s clear explanations and depth of technical expertise will keep you ahead in this exciting and emerging f ield.” Joe Rowe, head of technical assurance and compliance, Applied Data Science Partners “A practical introduction to generative AI with valuable insights on building real-world services. This book is a good starting point for aspiring AI developers.” Julian Brendel, senior Python developer, Vitol
Praise for Building Generative AI Services with FastAPI A masterclass in turning cutting-edge AI into real-world impact. Ali distills the complexity of generative models and FastAPI into an approachable, empowering guide for builders at every level. —Alan King, founder, AI Your Org A must-have for software developers and data scientists to learn to build production-grade generative AI services with FastAPI. Ali’s clear explanations and depth of technical expertise will keep you ahead in this exciting and emerging field. —Joe Rowe, head of technical assurance & compliance, Applied Data Science Partners This book is superb at taking complicated topics and explaining them in a simple, easy-to-understand way for developers and non-developers alike. It’s a fascinating deep dive into using GenAI in your development projects, which is only going to become more and more important as time goes on! —Lee Dalchow, software engineer A practical introduction to generative AI with valuable insights on building real-world services. This book is a good starting point for aspiring AI developers. —Julian Brendel, senior Python developer, Vitol The book is well-structured, and the way it presents the topics gives you a solid foundation in the subject. I’ve recommended it to my colleagues! —Daniel Saad, software engineer, Mercedes-Benz Tech Innovation
(This page has no text content)
Alireza Parandeh Foreword by David Foster Building Generative AI Services with FastAPI A Practical Approach to Developing Context-Rich Generative AI Applications
978-1-098-16030-2 [LSI] Building Generative AI Services with FastAPI by Alireza Parandeh Copyright © 2025 Ali Parandeh. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Amanda Quinn Development Editor: Rita Fernando Production Editor: Clare Laylock Copyeditor: Kim Wimpsett Proofreader: Vanessa Moore Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea April 2025: First Edition Revision History for the First Edition 2025-04-15: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098160302 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Building Generative AI Services with FastAPI, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Part I. Developing AI Services 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What Is Generative AI? 3 Why Generative AI Services Will Power Future Applications 6 Facilitating the Creative Process 7 Suggesting Contextually Relevant Solutions 9 Personalizing the User Experience 10 Minimizing Delay in Resolving Customer Queries 11 Acting as an Interface to Complex Systems 12 Automating Manual Administrative Tasks 13 Scaling and Democratizing Content Generation 13 How to Build a Generative AI Service 14 Why Build Generative AI Services with FastAPI? 15 What Prevents the Adoption of Generative AI Services 16 Overview of the Capstone Project 17 Summary 18 2. Getting Started with FastAPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Introduction to FastAPI 19 Setting Up Your Development Environment 20 Installing Python, FastAPI, and Required Packages 20 Creating a Simple FastAPI Web Server 21 v
FastAPI Features and Advantages 24 Inspired by Flask Routing Pattern 24 Handling Asynchronous and Synchronous Operations 24 Built-In Support for Background Tasks 25 Custom Middleware and CORS Support 25 Freedom to Customize Any Service Layer 26 Data Validation and Serialization 26 Rich Ecosystem of Plug-Ins 27 Automatic Documentation 28 Dependency Injection System 29 Lifespan Events 31 Security and Authentication Components 32 Bidirectional Web Socket, GraphQL, and Custom Response Support 32 Modern Python and IDE Integration with Sensible Defaults 33 FastAPI Project Structures 33 Flat Structure 34 Nested Structure 35 Modular Structure 36 Progressive Reorganization of Your FastAPI Project 38 Onion/Layered Application Design Pattern 39 Comparing FastAPI to Other Python Web Frameworks 44 FastAPI Limitations 47 Inefficient Model Memory Management 47 Limited Number of Threads 47 Restricted to Global Interpreter Lock 47 Lack of Support for Micro-Batch Processing Inference Requests 48 Cannot Efficiently Split AI Workloads Between CPU and GPU 48 Dependency Conflicts 49 Lack of Support for Resource-Intensive AI Workloads 49 Setting Up a Managed Python Environment and Tooling 50 Summary 52 3. AI Integration and Model Serving. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Serving Generative Models 54 Language Models 54 Audio Models 73 Vision Models 79 Video Models 87 3D Models 95 Strategies for Serving Generative AI Models 102 Be Model Agnostic: Swap Models on Every Request 102 Be Compute Efficient: Preload Models with the FastAPI Lifespan 104 vi | Table of Contents
Be Lean: Serve Models Externally 107 The Role of Middleware in Service Monitoring 111 Summary 114 Additional References 115 4. Implementing Type-Safe AI Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Introduction to Type Safety 118 Implementing Type Safety 121 Type Annotations 121 Using Annotated 124 Dataclasses 125 Pydantic Models 128 How to Use Pydantic 128 Compound Pydantic Models 129 Field Constraints and Validators 130 Custom Field and Model Validators 133 Computed Fields 135 Model Export and Serialization 136 Parsing Environment Variables with Pydantic 137 Dataclasses or Pydantic Models in FastAPI 139 Summary 145 Part II. Communicating with External Systems 5. Achieving Concurrency in AI Workloads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Optimizing GenAI Services for Multiple Users 150 Optimizing for I/O Tasks with Asynchronous Programming 157 Synchronous Versus Asynchronous (Async) Execution 158 Async Programming with Model Provider APIs 162 Event Loop and Thread Pool in FastAPI 166 Blocking the Main Server 168 Project: Talk to the Web (Web Scraper) 170 Project: Talk to Documents (RAG) 175 Optimizing Model Serving for Memory- and Compute-Bound AI Inference Tasks 194 Compute-Bound Operations 194 Externalizing Model Serving 195 Managing Long-Running AI Inference Tasks 205 Summary 207 Additional References 208 Table of Contents | vii
6. Real-Time Communication with Generative Models. . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Web Communication Mechanisms 210 Regular/Short Polling 212 Long Polling 213 Server-Sent Events 214 WebSocket 216 Comparing Communication Mechanisms 222 Implementing SSE Endpoints 223 SSE with GET Request 226 SSE with POST Request 232 Implementing WS Endpoints 236 Streaming LLM Outputs with WebSocket 236 Handling WebSocket Exceptions 243 Designing APIs for Streaming 244 Summary 245 7. Integrating Databases into AI Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 The Role of a Database 248 Database Systems 249 Project: Storing User Conversations with an LLM in a Relational Database 253 Defining ORM Models 255 Creating a Database Engine and Session Management 257 Implementing CRUD Endpoints 260 Repository and Services Design Pattern 264 Managing Database Schemas Changes 270 Storing Data When Working with Real-Time Streams 274 Summary 277 Part III. Securing, Optimizing, Testing, and Deploying AI Services 8. Authentication and Authorization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Authentication and Authorization 282 Authentication Methods 283 Basic Authentication 285 JSON Web Tokens (JWT) Authentication 289 Implementing OAuth Authentication 309 OAuth Authentication with GitHub 312 OAuth2 Flow Types 319 Authorization 322 Authorization Models 323 Role-Based Access Control 324 viii | Table of Contents
Relationship-Based Access Control 328 Attribute-Based Access Control 329 Hybrid Authorization Models 330 Summary 334 9. Securing AI Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Usage Moderation and Abuse Protection 335 Guardrails 338 Input Guardrails 339 Output Guardrails 343 Guardrail Thresholds 344 Implementing a Moderation Guardrail 344 API Rate Limiting and Throttling 347 Implementing Rate Limits in FastAPI 348 Throttling Real-Time Streams 353 Summary 355 10. Optimizing AI Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Optimization Techniques 357 Batch Processing 358 Caching 361 Model Quantization 376 Structured Outputs 381 Prompt Engineering 384 Fine-Tuning 392 Summary 396 11. Testing AI Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 The Importance of Testing 398 Software Testing 399 Types of Tests 399 The Biggest Challenge in Testing Software 401 Planning Tests 402 Test Dimensions 404 Test Data 405 Test Phases 405 Test Environments 406 Testing Strategies 407 Challenges of Testing GenAI Services 410 Variability of Outputs (Flakiness) 410 Performance and Resource Constraints (Slow and Expensive) 410 Regression 411 Table of Contents | ix
Bias 412 Adversarial Attacks 412 Unbound Testing Coverage 413 Project: Implementing Tests for a RAG System 413 Unit Tests 414 Integration Testing 430 End-to-End Testing 439 Summary 444 12. Deployment of AI Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 Deployment Options 445 Deploying to Virtual Machines 446 Deploying to Serverless Functions 448 Deploying to Managed App Platforms 452 Deploying with Containers 453 Containerization with Docker 455 Docker Architecture 455 Building Docker Images 456 Container Registries 458 Container Filesystem and Docker Layers 460 Docker Storage 462 Docker Networking 470 Enabling GPU Driver 477 Docker Compose 478 Enabling GPU Access in Docker Compose 482 Optimizing Docker Images 483 docker init 490 Summary 491 Afterword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 x | Table of Contents
Foreword I remember the day Ali, our head of engineering at ADSP, walked confidently into the office and declared he wanted to write a book on building generative AI services. Knowing the mammoth-sized undertaking that is writing a technical book, I offered him a strong cup of coffee and regaled him with a few tales of my own late-night writ‐ ing sessions, fueled by caffeine and the sheer will to meet a deadline. I might have even thrown in a cautionary whisper about the ever-present temptation to rewrite entire chapters at 3 a.m. But Ali was steadfast. He had that glint in his eye—a mix of determination and a clear vision. He knew he wanted to create something special, something that would demystify the complexities of generative AI and empower oth‐ ers to build. Having now read Building Generative AI Applications with FastAPI, I can say he’s done far more than that. Ali has crafted a truly indispensable guide for anyone look‐ ing to move beyond theoretical discussions about AI and into the realm of practical, real-world application. And somehow, he’s made the whole process look deceptively easy. As co-founder of an AI consultancy, I’ve seen firsthand the growing need for engi‐ neers who not only can understand how AI works but also build production-grade solutions with AI. We are in a period of profound transformation, where AI is rapidly changing how we live and work. It’s no longer enough to be a passive consumer of AI-powered products. The future belongs to those who can harness the power of gen‐ erative models to create, innovate, and solve real problems. This book is the perfect starting point for that journey. Ali’s technical expertise is evident on every page. He effortlessly blends complex con‐ cepts with clear, concise explanations and practical examples. The code snippets aren’t just toy examples; they are building blocks for real applications. He guides you through the intricacies of FastAPI, authentication, authorization, and database inte‐ gration with the confidence of a seasoned engineer who has spent countless hours wrestling with these challenges in the real world. xi
Building Generative AI Applications with FastAPI is a vital resource for any engineer looking to navigate the rapidly evolving landscape of AI. It’s a testament to Ali’s tech‐ nical leadership and his remarkable ability to make the complex accessible. This book isn’t just about building AI services; it’s about empowering a new generation of engi‐ neers to shape the future. And that future, thanks to works like this, looks incredibly bright. — David Foster Partner at ADSP Author of Generative Deep Learning (O’Reilly, 2024) xii | Foreword
Preface Generative AI (GenAI) is taking the world by storm since the release of technologies like ChatGPT. This new type of AI can create content in various modalities (such as text, audio, video, etc.) by learning to mimic patterns from its training data. With the increased advancement in GenAI capabilities, many businesses are investing in off- the-shelf or custom AI tools. These tools require maintainable and scalable backend services that can adapt to high demand. AI capabilities are exciting because they open the door to endless possibilities that unlock the potential for new tools. Before generative AI, developers had to write scripts and train optimization models to build automation and data pipelines for their processing of unstructured data like corpora of texts. This process could be tedious, error-prone, and applicable only to limited use cases. However, with the rise of GenAI models such as large language models (LLMs), we can now digest, compare, and summarize unstructured datasets and documents; reword complex ideas; and generate visualizations and illustrations. While most generative models such as ChatGPT are excellent at what they do on their own, can you imagine the possibilities when we connect them to the internet, our own databases, and other services? If we can just “talk” to our services in natural language or give them some image, video, or audio and get them to do things for us, it opens up so many opportunities to create newly accessible and automated applications. Chatbots are not the only apps that we can create with such generative models. There is so much more we can do. We can create backend service agents that can perform various complex tasks requiring comprehension, logical reasoning, and analysis of texts. By connecting our generative models to existing services and the internet, we are giv‐ ing our AI services additional data to enrich their understanding of the problem at hand. For instance, a company can use an open source, in-house, fine-tuned LLM to parse purchase orders, generate invoices, and validate data against their customer xiii
database before placing an order with a payment system. This is where generative models shine. Other use cases can include content management systems that can help users with generating content and website builders that can suggest imagery, icons, and user interface (UI) components to fast-track the site’s design. There is a catch. LLMs and other generative models require heavy processing power and memory to function, and it is not clear what deployment patterns and integration layers the developers should use to leverage these models. Building generative AI services is challenging because you need to balance scalability, security, performance, and data privacy. You’ll also want the ability to moderate, retrain, and optimize these services for real-time inference. These challenges will be different for every organiza‐ tion, and how you build your generative AI services will depend on your existing software systems and services. Existing resources and documentation provide the necessary information to get started with training custom models and fine-tuning large language models. How‐ ever, most developers may continue to face challenges in packaging and deploying these novel generative models as part of existing software systems and services. My aim with this book is to show you how to productionize GenAI by understanding the end-to-end process in building and deploying your own AI services with tools such as the FastAPI web framework. Objective and Approach The objective of this book is to help you explore the challenges of developing, secur‐ ing, testing, and deploying generative AI as services integrated with your own exter‐ nal systems and applications. This book centers on constructing modular, type-safe generative AI services in Fast‐ API with seamless database schema handling support and model integration to power backends that can generate new data. The significance of these topics stems from the growing demand for building flexible services that can adapt to changing requirements, maintain high performance, and scale efficiently using the microservice pattern. You will also learn the process of enriching your services with contextual data from a variety of sources such as databases, the web, external systems, and files uploaded by users. A few generative models require heavy processing power and memory to function. You will explore how to handle these models in production and how to scale your services to handle the load. You will also explore how to handle long-running tasks such as model inference. xiv | Preface
Finally, we will discuss authentication concepts, security considerations, performance optimization, testing, and deployment of production-ready generative AI services. Prerequisites This book assumes no prior knowledge of generative AI and won’t require you to fully understand how generative models work. I will be covering the intuition of how such models generate data but will not dive into their underlying mathematics. How‐ ever, if you want to learn more about building your own generative AI models in detail, I recommend Generative Deep Learning by David Foster (O’Reilly, 2024). As this is a FastAPI book for generative AI applications, I do assume some familiarity with this web framework. If you need a refresher or would like to expand your under‐ standing of FastAPI features, I recommend reading FastAPI by Bill Lubanovic (O’Reilly, 2023). However, this is not a requirement for following along with this book. Furthermore, the book does assume some experience with Python, with Docker for deployment, with how the web works, and with communicating through the HTTP protocol. To brush up on your Python skills, I highly recommend visiting realpython.org for excellent tutorials on more advanced concepts. The official Docker website also pro‐ vides an excellent practical tutorial on containerization and writing Dockerfiles. I will not be covering the fundamentals of the web in this book, but I highly recom‐ mend MDN’s documentation as a starting point. Finally, the book won’t require knowledge of deep learning frameworks such as Tensorflow and Keras. Where relevant, you’ll be introduced to these frameworks. Instead, we will mostly work with pretrained models hosted on the Hugging Face model repository. Book Structure The book is broken into three parts: Part I, “Developing AI Services” This part covers all the necessary steps to set up a FastAPI project that will power your GenAI service. You will learn to integrate various generative models into a type-safe FastAPI application and expose endpoints to interact with them. • Chapter 1, “Introduction”: This chapter discusses the importance of GenAI in the future and introduces the practical projects you’ll build throughout the book. Preface | xv
• Chapter 2, “Getting Started with FastAPI”: This chapter introduces FastAPI, a modern framework for building AI services. You will understand its fea‐ tures, limitations, and how it compares to other web frameworks. By the end of this chapter, you will be able to start creating FastAPI applications, pro‐ gressively organize projects, and migrate from frameworks like Flask or Django. • Chapter 3, “AI Integration and Model Serving”: This chapter covers the full process of integrating and serving various GenAI models (including lan‐ guage, audio, vision, and 3D models) as a FastAPI service using application lifespan. We’ll review various strategies for model serving like preloading, externalizing, and monitoring models with middleware. • Chapter 4, “Implementing Type-Safe AI Services”: This chapter introduces the concept of type-safety and how Python’s type annotations and data vali‐ dation tools like Pydantic can help validate and serialize data running past your AI services. Part II, “Communicating with External Systems” In this part, we’ll integrate our AI services with external systems such as data‐ bases and learn how to serve concurrent users. We will also implement real-time streaming of model outputs. • Chapter 5, “Achieving Concurrency in AI Workloads”: This chapter introdu‐ ces the concepts of concurrency and parallelism alongside comparing differ‐ ent strategies for solving concurrency problems. We’ll review the purpose of asynchronous programming in handling long-running and blocking tasks and review the limitations of Python’s Global Interpreter Lock (GIL) when handling these asynchronous processes. To practice, we’ll implement a work‐ ing “talk to the web and your documents” chatbot using a technique called retrieval augmented generation (RAG). Finally, we’ll cover FastAPI’s back‐ ground tasks feature for tackling long-running operations. • Chapter 6, “Real-Time Communication with Generative Models”: In this chapter, we will focus on enabling real-time client-server communication with generative models. As part of this, we’ll compare various mechanisms such as web sockets and server streaming events when streaming data to/ from generative models with practical examples. • Chapter 7, “Integrating Databases into AI Services”: This chapter provides an overview of database technologies suitable for GenAI services. We’ll cover best practices when working with databases using battle-tested tools such as SQLAlchemy ORM and Alembic for facilitating migrations. Finally, we’ll introduce Prisma, an upcoming tool for generating a fully typed database cli‐ ent and automatic handling of migrations. xvi | Preface
Part III, “Securing, Optimizing, Testing, and Deploying AI Services” In this part, we focus on implementing the authentication layer for user manage‐ ment, alongside security and optimization enhancements. We’ll then shift our focus on testing and finally deploying our AI service through containerization. • Chapter 8, “Authentication and Authorization”: In this chapter, we will cover the implementation of authentication layers for user management to secure, protect, and restrict access to AI services. We’ll review and implement vari‐ ous authentication strategies including basic, token-based, and OAuth. We’ll then introduce authorization models including role-based access control (RBAC) and explain the role of FastAPI’s dependency graph in the process. This will include adding restrictive permissions for users based on roles where AI service interactions can be automatically moderated. • Chapter 9, “Securing AI Services”: This chapter provides an overview of common attack vectors for generative solutions. Here, we’ll shift focus on implementing various security measures across our AI service, such as rate limiting and guardrails, to protect against toxic model outputs, common attacks, abuse, and misuse. • Chapter 10, “Optimizing AI Services”: This chapter covers various perfor‐ mance optimization techniques like batch processing, semantic caching, and prompt engineering for enhancing the quality and speed of AI services. • Chapter 11, “Testing AI Services”: This chapter covers the challenges and best practices in testing AI services. We’ll review various testing concepts including testing phases, boundaries, and mocks and then implement mocks of external services, keeping test environments isolated. Finally, we’ll intro‐ duce a novel approach to testing generative AI models even when they pro‐ duce varying outputs across test runs. • Chapter 12, “Deployment of AI Services”: This chapter covers various deployment approaches including the use of virtual machines, cloud func‐ tions, managed app services, and containerization technologies like Docker. We’ll then focus on containerization concepts, such as storage and network‐ ing, for deploying our AI service using Docker. How to Read This Book This book can be read cover to cover or used as a reference so you can dip into any chapter. In every chapter, I explain the concepts and compare approaches before we dive into practical code examples. Therefore, I recommend reading each chapter twice: once to understand the approach and then revisiting them to work through the code examples yourself using this book’s accompanying code repository. Preface | xvii
I am a firm believer in explaining complex technical concepts with everyday analogies, diagrams, and stories that anyone can relate to. These are often used after a new complex concept is introduced. Look out for tip sections like this one to help improve your under‐ standing of the concepts. Ultimately, the best way to learn the concepts in this book is to get your hands on an open source generative model and then build a service around it using your own code. Above all, I hope you find it a useful and enjoyable read! Hardware and Software Requirements Running generative models is generally a compute-intensive task that requires a strong GPU. However, I’ve tried my best to provide code examples that use small open source generative models that won’t require a GPU. Only a few chapters will have code examples that require you to have access to a GPU to process concurrent operations or to run heavier models. In such cases, I recom‐ mend renting a virtual machine with CUDA-enabled NVIDIA GPUs from any cloud provider or to work from a CUDA-enabled GPU desktop with a minimum of 16 GB of VRAM. Please refer to NVIDIA’s CUDA installation instructions for Windows or Linux. Finally, to run models on a CUDA-enabled NVIDIA GPU, you will also need to install the torch package compiled for CUDA. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. xviii | Preface