Statistics
9
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2026-03-24

AuthorRoland Huss, Daniele Zonca

Generative AI is revolutionizing industries, and Kubernetes has fast become the backbone for deploying and managing these resource-intensive workloads. This book serves as a practical, hands-on guide for MLOps engineers, software developers, Kubernetes administrators, and AI professionals ready to combine AI innovation with the power of cloud native infrastructure. Authors Roland Huß and Daniele Zonca provide a clear road map for training, fine-tuning, deploying, and scaling GenAI models on Kubernetes, addressing challenges like resource optimization, automation, and security along the way. With actionable insights with real-world examples, readers will learn to tackle the opportunities and complexities of managing GenAI applications in production environments. Whether you're experimenting with large-scale language models or facing the nuances of AI deployment at scale, you'll uncover expertise you need to operationalize this exciting technology effectively. Learn how to deploy LLMs more efficiently with optimized inference runtimes Get hands-on with GPU scheduling, including hardware detection and multinode scaling Monitor and understand LLM-specific metrics like Time to First Token and token throughput Know when to fine-tune a model or when retrieval augmentation is the better choice Discover how to evaluate models with standardized benchmarks before committing GPU resources Learn to run agentic applications with secure tool integration, identity management, and persistent state

Tags
No tags
ISBN: 1098171926
Publisher: O'Reilly Media
Publish Year: 2026
Language: 英文
Pages: 407
File Format: PDF
File Size: 8.0 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

Roland Huß & Daniele Zonca Operationalizing Large Language Models Generative AI on Kubernetes
ISBN: 978-1-098-17192-6 US $59.99 CAN $74.99 GENER ATIVE AI Generative AI is revolutionizing industries, and Kubernetes has fast become the backbone for deploying and managing these resource-intensive workloads. This book serves as a practical, hands-on guide for MLOps engineers, software developers, Kubernetes administrators, and AI professionals ready to combine AI innovation with the power of cloud native infrastructure. Authors Roland Huß and Daniele Zonca provide a clear road map for deploying, fine-tuning, and scaling large models on Kubernetes to power AI-driven applications, addressing challenges like GPU resource optimization, observability, and security along the way. With actionable insights and real-world examples, readers will learn to tackle the opportunities and complexities of managing GenAI applications in production environments. Whether you’re experimenting with large language models (LLMs) or facing the nuances of AI deployment at scale, you’ll uncover expertise you need to operationalize this exciting technology effectively. • Learn how to deploy LLMs more efficiently with optimized inference runtimes • Get hands-on with GPU scheduling, including hardware detection and multinode scaling • Monitor and understand LLM-specific metrics like Time to First Token and token throughput • Know when to fine-tune a model or when retrieval augmentation is the better choice • Discover how to evaluate models with standardized benchmarks before committing GPU resources • Learn to run agentic applications with secure tool integration, identity management, and persistent state Generative AI on Kubernetes Dr. Roland Huß is a distinguished engineer at Red Hat with over 25 years of experience in software engineering, specializing in infrastructure for AI-enabled applications, serverless architectures, and cloud native platforms. Daniele Zonca is a chief architect at Red Hat, responsible for the technical vision of Red Hat AI offerings on Kubernetes. He specializes in enterprise AI adoption using open source projects including TrustyAI, KServe, vLLM, llm-d, and Kubeflow. “This book bridges the worlds of Kubernetes and AI with a broad yet practical perspective on operating GenAI systems.” Bilgin Ibryam, coauthor of Kubernetes Patterns, principal product manager at Diagrid “This book is an invaluable resource for infrastructure engineers, whether they’re looking to deploy generative AI applications on Kubernetes or break into the field with a strong understanding of modern infrastructure.” Nikhil Devnani, senior machine learning engineer
Praise for Generative AI on Kubernetes The book does an excellent job of bridging the gap between Kubernetes expertise and the real-world challenges of running large language models in production. I appreciate its clear structure—progressing from simple to advanced examples, from core concepts to real-world applications, and from foundational building blocks to complete system implementations. Its use of cutting-edge open source projects from the cloud-native AI ecosystem to demonstrate key ideas truly sets it apart from other books in this space. —Yuan Tang, senior principal software engineer at Red Hat This book’s content is strategic in the short term to equip readers with the knowledge required to successfully deploy and manage GenAI and LLM workloads efficiently with Kubernetes, and in the longer term by teaching concepts that can be tuned to the actual use case motivating the AI-infused workload. —Matteo Mortari, principal software engineer, Red Hat AI This book bridges the worlds of Kubernetes and AI with a broad yet practical perspective on operating GenAI systems. —Bilgin Ibryam, coauthor of Kubernetes Patterns, principal product manager at Diagrid This book is an invaluable resource for infrastructure engineers, whether they’re looking to deploy generative AI applications on Kubernetes or break into the field with a strong understanding of modern infrastructure. —Nikhil Devnani, senior machine learning engineer
(This page has no text content)
Roland Huß and Daniele Zonca Generative AI on Kubernetes Operationalizing Large Language Models
978-1-098-17192-6 [LSI] Generative AI on Kubernetes by Roland Huß and Daniele Zonca Copyright © 2026 Roland Huß and Daniele Zonca. All rights reserved. Published by O’Reilly Media, Inc., 141 Stony Circle, Suite 195, Santa Rosa, CA 95401. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editors: John Devins, Megan Laddusaw Development Editor: Angela Rufino Production Editor: Clare Laylock Copyeditor: Sonia Saruba Proofreader: Piper Content Partners Indexer: Sue Klefstad Cover Designer: Susan Thompson Cover Illustrator: Monica Kamsvaag Interior Designer: David Futato Interior Illustrator: Kate Dullea March 2026: First Edition Revision History for the First Edition 2026-02-27: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098171926 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Generative AI on Kubernetes, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Red Hat. See our statement of editorial independence.
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi Part I. Inference 1. Deploying Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 “It Works on My Machine” 4 Model Server 6 vLLM 8 Hugging Face Text Generation Inference 11 Other Model Servers 13 Deploying Models to Kubernetes Manually 16 Model Server Controller 19 KServe 20 Ray Serve and KubeRay 27 Lessons Learned 30 2. Model Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Model Data Storage Formats 34 Weight-Only Formats 35 Self-Contained Formats 37 ONNX 41 Safetensors 42 GGUF and GGML 44 Current State and Gaps 45 Model Registry 46 v
Hugging Face Model Hub 48 MLflow Model Registry 49 Kubeflow Model Registry 54 OCI Registry 57 Accessing Model Data in Kubernetes 59 Shared Storage with PersistentVolumes 61 OCI Image for Storing Model Data 65 Modelcars 68 OCI Image Volume Mounts 74 Lessons Learned 76 Part II. Production Readiness 3. Kubernetes and GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 GPU Discovery 82 Node Feature Discovery 83 GPU Feature Discovery 84 Kubernetes GPU Device Plug-Ins 86 GPU Workload Scheduling 87 Label-Based Scheduling 88 Resource-Based Scheduling 91 Dynamic Resource Allocation 92 NVIDIA GPU Operator 95 Operator Configuration with ClusterPolicy 97 Sub-GPU Allocation 98 Multi-GPU Inference 104 Data Parallelism 104 Model Parallelism 106 Single-Node Versus Multinode Inference 110 GPU Resource Optimizations 114 Lessons Learned 116 4. Running in Production. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Model and Runtime Tuning 120 Language Model Evaluation 121 Language Model Compression 123 Model Performance Benchmark 125 vLLM Runtime Parameters Tuning 128 Autoscaling 132 Optimize vLLM Startup Time 136 LLM-Aware Routing 139 vi | Table of Contents
From API Gateway to AI Gateway 143 Gateway API Inference Extension 145 Disaggregated Serving 148 Lessons Learned 152 5. Model Observability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Observability Stack and Configuration 154 Logs 154 Metrics 156 Tracing 158 Model Server Metrics 160 Time To First Token 161 Time Per Output Token or Inter-Token Latency 161 Throughput 162 Latency 162 Request Queue Metrics 163 GPU Usage Monitoring 164 Quality Metrics 165 Responsible AI 167 Explainability 167 Fairness 168 Model Safety: Hallucination and Guardrails 169 Understanding and Detecting Hallucinations 169 Runtime Guardrails 170 Lessons Learned 175 Part III. Tuning 6. Model Customization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Introduction to LLM Creation 179 Prompt and Context Engineering 181 When to Use Model Customization 183 Tuning a Model 184 Fine-Tuning 185 Parameter-Efficient Fine-Tuning 186 Low-Rank Adaptation 187 Running Tuning Jobs on Kubernetes 190 Kubeflow Trainer 192 Other Frameworks 200 Lessons Learned 203 Table of Contents | vii
7. Job Scheduling Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Kubernetes Scheduler Optimization 207 Core Kubernetes Scheduler 207 Resource Bin Packing Strategy 208 Dynamic Scheduling with Descheduler 209 Gang Scheduling 212 PyTorch Rendezvous and Gang Scheduling 213 Comparing Gang Scheduling Solutions 213 Topology-Aware Scheduling 220 Comparing Topology-Aware Scheduling Solutions 223 Quota Management and Multitenancy: GPU as a Service 229 Comparing Quota Management and Multitenancy Solutions 229 Network Optimization for Distributed Training 235 Comparing Network Technologies for GPU Communication 238 Using Secondary Network Interfaces in Kubernetes 243 Bridging HPC and Kubernetes: Slurm and Slinky 248 Storage for Training 249 Training Job Security 250 Security Guidelines for Ray 251 Security Guidelines for PyTorch 254 Observability of Training Jobs 254 Metrics Collection for Distributed Training 255 Logging Across Distributed Workers 256 Tracing Distributed Training Operations 256 Lessons Learned 257 Part IV. AI-Driven Apps 8. AI-Driven Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Architectural Patterns 262 Kubernetes Workload Types 262 Chat Applications 263 Backend AI Services 266 Retrieval-Augmented Generation 274 RAG Components 275 Document Ingestion 278 User Query Processing 281 RAG on Kubernetes 283 Agentic Workflows 286 Agentic Frameworks and Runtimes 290 OpenAI’s Responses API 291 viii | Table of Contents
Agents on Kubernetes 292 Multiagent Systems 295 Ambient Agents 298 Lessons Learned 300 9. Running Agentic Applications in Production. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 The Model Context Protocol 303 MCP Security 305 Agent Impersonation (Token Passthrough) 306 Service Account Delegation 307 Delegated Identity via OAuth2 Token Exchange 318 Mutual TLS with SPIFFE/SPIRE (Zero-Trust) 321 Agent-to-Agent Protocol 329 A2A complements MCP 329 A2A in a Nutshell 330 Running A2A on Kubernetes 331 Agent State Management 332 State Storage Patterns 333 Choosing Between Key-Value Stores and Databases 334 Checkpointing for Long-Running Agents 335 Lessons Learned 336 Afterword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Table of Contents | ix
(This page has no text content)
Preface The end of 2022 marked a turning point in the world of AI with the release of ChatGPT, a chat-based language model designed to generate human-like text in response to conversational input. We all witnessed an AI revolution that transformed our expectations and possibilities. Generative AI models have been around for a while. In fact, deep learning concepts have existed for decades, but it’s only with the recent availability of large amounts of data and advances in accelerators and compute power that this AI revolution finally became possible. This, combined with a massive increase in model parameters reaching billions, has brought about a remarkable shift. Imagine a phase transition in physics: the same substance suddenly exhibits com‐ pletely new properties. That’s what happened with AI, revealing new capabilities that were previously unimaginable, such as advanced natural language processing (NLP) and the ability to generate coherent and contextual responses. Small steps in AI development led to significant impacts, as we have seen over the past few years when interest in generative AI models and their diverse applications has exploded. While this early pioneering era is exciting, it is also extremely demanding. As of early 2026, you can find millions of generative AI models on the Hugging Face Hub, the central repository of the AI community, for various applications. Once you choose a model, the main question for application developers and Machine Learn‐ ing Operations (MLOps) engineers is how to operate these models in production systems. Nonfunctional aspects such as resilience, scalability, security, and above all, operational cost, are paramount. The challenge of bringing a model from experimen‐ tation (such as a Jupyter Notebook) into production is not trivial. Fortunately, a distributed software platform has emerged in recent years to manage various types of workloads in a scalable and resilient manner: Kubernetes. When Kubernetes was introduced in 2014, generative AI was still a distant concept. Kubernetes initially excelled as a platform for stateless (web) applications and micro‐ services, but it has evolved into a reliable foundation for running stateful applications such as databases and messaging systems. A similar evolution is underway for the xi
specific requirements of large language models (LLMs) with their enormous data structures and special hardware needs. This book examines the various challenges and solutions for operating generative AI in general and LLMs in particular. Why We Wrote This Book Our motivation to write this book stems from the growing need to bridge the gap between Kubernetes experts and the emerging demands of running LLM production. As LLMs have become increasingly essential in various industries, the challenge is no longer just about developing these models but also about deploying, scaling, and maintaining them effectively in real-world production environments. We approach LLM workloads as black boxes by acknowledging their operational complexity without requiring the deep insights of a data scientist. This perspective is crucial for Kubernetes practitioners who want to operationalize these models without delving into the details of machine learning (ML). By focusing on Kubernetes as the underlying platform, we provide practical guidance on how to use Kubernetes to meet the unique requirements of LLMs, ensuring they run efficiently, securely, and at scale. This book is our contribution to helping you with the challenges of operationalizing generative AI on Kubernetes, empowering you to bring LLMs and AI-driven applica‐ tions into production with confidence. Kubernetes Kubernetes, also referred to as K8s, is a container orchestration platform designed to automate the deployment, scaling, and management of containerized applications. Initially focusing on stateless applications, it has evolved to support stateful work‐ loads such as databases and messaging systems. Today, Kubernetes stands out as the dominant operational platform for a wide array of traditional workloads, and it is increasingly pivotal in the AI domain. Several pioneering initiatives and organizations have chosen Kubernetes to power their AI workloads, benefiting from its robust scalability and resilience. For instance, companies like Google and OpenAI leverage Kubernetes to manage their complex machine learning pipelines and deployment processes. Kubernetes abstracts and automates many operational aspects, such as scaling, load balancing, and self-healing. This allows developers and MLOps engineers to focus on domain-specific tasks without worrying about the underlying infrastructure. Its support for declarative configuration and infrastructure as code, which can be lever‐ aged with GitOps, ensures consistency and reliability across deployments. xii | Preface
1 Ian Goodfellow et al., “Generative Adversarial Nets,” Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014, pp. 2672-2680. 2 Ashish Vaswani et al., “Attention Is All You Need,” Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017, pp. 5998-6008. One of Kubernetes’ most significant strengths lies in its ability to compose larger applications that encompass multiple types of workloads, including serving LLMs. While specialized platforms like Ray or Spark excel at running specific ML and AI workloads, they are purpose-built for these use cases and do not provide the same level of native integration for diverse workload types that Kubernetes offers. Kuber‐ netes, on the other hand, can seamlessly manage AI models alongside traditional business applications, databases, and microservices. This holistic approach not only simplifies operations but also enhances the efficiency of developing and deploying complex applications that require different types of workloads to work together smoothly. Generative AI The history of generative AI is a fascinating journey that spans several decades, marked by groundbreaking innovations and rapid technological advancements. Its roots can be traced back to the mid-20th century, with early foundations laid by pioneers like Claude E. Shannon, who introduced the concept of sequences of letters or words to predict subsequent characters in a string for text generation in 1948, and Alan Turing, whose 1950 paper proposed the famous Turing Test. The true revolution in generative AI (GenAI), however, began in the 2010s with the advent of deep learning techniques. The 2012 AlexNet breakthrough on the ImageNet dataset proved that deep neural networks could work at scale, catalyzing the broader deep learning revolution. Building on this foundation, the introduction of Generative Adversarial Networks (GANs) by Ian Goodfellow in 20141 marked a pivotal moment for generative AI specifically, enabling the creation of highly realistic synthetic data across various modalities. This was followed by the development of the Transformer model by Google researchers in 2017,2 which revolutionized natural language pro‐ cessing. The subsequent years saw the rapid evolution of LLMs, with OpenAI’s GPT series, particularly GPT-3 and its successors, demonstrating unprecedented capabilities in text generation, language understanding, and even code writing. The launch of ChatGPT in 2022 brought LLMs into the mainstream, sparking widespread public interest and debate about the potential and implications of this technology. As we move forward, the field continues to evolve at a breakneck pace, with ongoing advancements pushing the boundaries of what’s possible. However, only recently have the capabilities of generative AI models reached a level where they can be offered as services for a wider audience. As a consequence, people Preface | xiii
started to think about the best way to run GenAI in general, and specifically, LLM workloads in production. The journey of bringing GenAI workloads into production has evolved significantly. Initially, AI models were deployed ad hoc, with bespoke scripts and manual processes. As the field matured, frameworks like TensorFlow Serving and tools like MLflow emerged to streamline the deployment experience. However, the operational challenges of managing these workloads at scale required more sophisticated solutions. Kubernetes, with its powerful orchestration capabilities, began to play a crucial role in managing ML workloads, providing a scalable and resilient platform for deployment. Unlike traditional ML models, LLMs require speci‐ alized infrastructure, including high-performance GPUs and distributed computing environments, to handle their size and computational demands. Deploying generative AI models, particularly LLMs, in a production environment is far from straightforward. The operational challenges are significant. Generative AI models require vast amounts of training data. After deployment, while individ‐ ual inference requests typically process small amounts of data (like user prompts), some production scenarios such as retrieval-augmented generation (RAG) or batch processing may involve handling larger datasets. Moreover, the effective usage of expensive accelerators like GPUs is critical. These rare hardware resources are essen‐ tial for the performance of LLMs, and the AI workload orchestration platform must ensure context-aware scheduling to optimize their utilization, ensuring efficient and cost-effective operation. In this book, we will address these challenges and show how Kubernetes can be used to successfully deploy and manage generative AI models at scale. How This Book Is Structured This book guides you through operationalizing generative AI models on Kubernetes, and is organized into four parts that reflect the practical journey from initial deploy‐ ment to production-scale AI applications. Whether you’re a Kubernetes practitioner encountering AI workloads for the first time, or an MLOps engineer seeking to leverage Kubernetes more effectively, you’ll find the content builds on your existing knowledge while introducing new concepts progressively. Unlike traditional machine learning books that start with training, we begin with inference: deploying and serving pre-trained models. This reflects how most organi‐ zations adopt generative AI today. You typically start with existing foundation models rather than building from scratch, making model serving the natural entry point for bringing AI capabilities into production. xiv | Preface
We have organized the content as follows: The book opens with the Introduction, which examines the operational challenges of running generative AI at scale and includes an optional technical primer on LLM fundamentals: tokenization, embeddings, and the two-phase inference process. This primer helps you understand operational metrics without requiring deep machine learning expertise, though you can skip it and treat LLMs as pure black boxes. The four parts that follow are: • Part I, “Inference”, establishes the foundation for deploying and serving LLMs. You’ll learn how model size, storage requirements, and initialization time cre‐ ate unique challenges compared to traditional workloads. These chapters cover packaging models in containers, managing multigigabyte model weights in per‐ sistent storage, and handling workloads that require minutes to become ready. The focus is on getting your first generative AI service running reliably. • Part II, “Production Readiness”, addresses what happens after successful deploy‐ ment. GPU resource management becomes critical as you learn to schedule scarce accelerators efficiently and maximize their utilization. You’ll then explore scaling strategies that account for model warm-up times, rolling updates that maintain service availability, and optimization techniques that balance perfor‐ mance with cost. The final chapter covers LLM observability, showing how to track metrics beyond CPU and memory: token throughput, prompt latency, inference costs, and model accuracy. • Part III, “Tuning”, shifts focus to model customization. Fine-tuning adapts pre- trained models to specific domains or tasks, but introduces intense resource demands. A single tuning job may require multiple GPUs working in concert, consuming significant cluster resources. You’ll explore techniques like LoRA and PEFT that make customization more efficient, along with the operational chal‐ lenges of managing tuning jobs on Kubernetes: job scheduling, quota allocation, resource management, and GPU configuration optimization. • Part IV, “AI-Driven Apps”, shows how to build complete applications around LLM services. These chapters present architectural patterns for AI-driven sys‐ tems, from chat interfaces and event-driven backends to retrieval-augmented generation that enhances model responses with domain-specific knowledge. You’ll explore agentic workflows where models coordinate tool invocation and multistep reasoning, then tackle the production challenges unique to agentic systems: security, state management, observability, cost control, and reliability. The final chapter introduces protocols like MCP and A2A that standardize tool and agent communication. Each chapter builds on concepts introduced earlier, while remaining approachable for selective reading. If you need to optimize GPU utilization immediately, jump to Preface | xv
Chapter 3, “Kubernetes and GPUs”. If you’re architecting AI-enabled applications, Part IV, “AI-Driven Apps”, provides the patterns you need. Linear readers will find a natural progression from deployment fundamentals through production operations to advanced applications. Throughout the book, we maintain a practical operational perspective. You don’t need deep knowledge of transformer architectures or neural network mathematics, just as you don’t need to understand database internals to run PostgreSQL on Kuber‐ netes. We treat LLMs as specialized workloads with unique requirements, showing you how to meet those requirements using platform capabilities and ecosystem tools. Who This Book Is For This book is designed for MLOps practitioners, operational folks tasked with running AI workloads at scale in production, and architects who need to understand the unique architectural constraints of managing large AI workloads. The goal is to pro‐ vide these professionals with practical insights and tools to operationalize generative AI effectively on Kubernetes. However, it’s important to clarify who might not find this book ideal. This book does not directly address data scientists focused on the algorithmic aspects of LLMs. For those interested in the mathematical foundations and detailed workings of LLMs, we recommend Generative Deep Learning by David Foster (O’Reilly, 2023). However, curious data scientists can still benefit from this book by learning how their artifacts can be run in production to serve the real world. This book assumes you have a basic understanding of Kubernetes. It is not an introduction to Kubernetes, and some familiarity with its concepts and features is required. If you need a more comprehensive foundation, we suggest Kubernetes in Action by Marko Lukša (Manning, 2018) or Kubernetes Patterns by Bilgin Ibryam and Roland Huß (O’Reilly, 2023) for a deeper dive into Kubernetes principles and best practices. The insights shared in this book explore and dive into the exploding landscape of productization of generative AI on Kubernetes. We are on the same journey, present‐ ing and demonstrating emerging principles and patterns. What You Will Learn In this book, you will explore how to leverage Kubernetes to operationalize generative AI models, addressing the unique challenges and solutions required to run LLMs effectively on this platform. We will demonstrate why Kubernetes is an excellent choice for running complex applications that integrate AI models and usual business logic, ensuring a seamless, efficient, and scalable deployment process. xvi | Preface
You’ll gain insights into the best practices, tools, and techniques needed to optimize your generative AI models in production. We’ll provide a snapshot of the tool land‐ scape as it stands in 2026. While the ecosystem remains dynamic, you can expect insights into enduring players like Ray, Kubeflow, and vLLM, which are likely to survive the initial gold rush of generative AI tools. This perspective will help you navigate and choose the right tools for your needs. Furthermore, you’ll learn how Kubernetes plays a pivotal role in scaling, resource management, and orchestration for AI workloads. By the end of this book, you will have a comprehensive understanding of how to overcome the operational hurdles of deploying LLMs on Kubernetes and how to manage and deploy AI applications efficiently. The chapters focus on practical use cases, lessons learned, and best practices, aiming to equip you with the knowledge and tools to confidently transition from develop‐ ment to production. With plenty of examples and detailed explanations, you’ll gain hands-on experience in setting up and maintaining a robust infrastructure for your AI projects. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. This element signifies a tip or suggestion. This element signifies a general note. Preface | xvii
This element indicates a warning or caution. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 141 Stony Circle, Suite 195 Santa Rosa, CA 95401 800-889-8969 (in the United States or Canada) 707-827-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://oreilly.com/about/contact.html We have a web page for this book, where we list errata and any additional informa‐ tion. You can access this page at https://oreil.ly/genAI-kubernetes. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly. Watch us on YouTube: https://youtube.com/oreillymedia. xviii | Preface