AI Systems Performance Engineering (for Raymond Rhine) (Chris Fregly) (Z-Library)

(This page has no text content)

AI Systems Performance Engineering Optimizing Hardware, Software, and Algorithms for Efficient Training and Inference With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write —so you can take advantage of these technologies long before the official release of these titles. Chris Fregly

AI Systems Performance Engineering by Chris Fregly Copyright © 2025 Flux Capacitor, LLC. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Angela Rufino and Nicole Butterfield Production Editor: Kristen Brown Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea September 2025: First Edition

Revision History for the Early Release 2025-04-18: First Release See http://oreilly.com/catalog/errata.csp?isbn=9798341627789 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. AI Systems Performance Engineering, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

979-8-341-62778-9 [LSI]

Brief Table of Contents (Not Yet Final) Preface (available) Chapter 1: Introduction and AI System Overview (available) Chapter 2: AI System Hardware Overview (available) Chapter 3: OS, Docker, and Kubernetes Tuning for GPU-based Environments (available) Chapter 4: Distributed Communication and I/O Optimizations (available) Chapter 5: CUDA Programming, Profiling, and Debugging (unavailable) Chapter 6: Optimizing CUDA Performance (unavailable) Chapter 7: PyTorch Profiling and Tuning (unavailable) Chapter 8: Distributed Training at Ultra‑Scale (unavailable) Chapter 9: Multi-Node Inference Optimizations (unavailable) Chapter 10: AI System Optimization Case Studies (available)

Chapter 11: Future Trends in Ultra-Scale AI Systems Performance Engineering (available) Chapter 12: AI Systems Performance Checklist (175+ Items) (available)

Preface In the vibrant streets of San Francisco, where innovation is as common as traffic on the 101 highway, I find myself immersed in the awesome world of artificial intelligence. The rapid advancements in AI are redefining the very fabric of our daily lives. From personalization and recommendation engines in the 2000’s and 2010’s to AI assistants and autonomous vehicles in the 2020’s and 2030’s, AI’s influence is pervasive and profound. My journey into this fast-moving field was driven by a curiosity to understand the intricate balance between hardware and software that powers AI systems. A few years ago, it became evident that the performance of AI applications was dependent on sophisticated algorithms as well as the underlying hardware and software that support them. The synergy and co-design between cutting-edge hardware, meticulous software, and clever algorithms is critical in achieving unprecedented levels of efficiency and scalability. This realization inspired me to dive deep into the realm of “full- stack” AI performance engineering. I wanted to understand how many components including processors, memory architectures, network interconnects, operating systems, and

software frameworks all interact to create robust and efficient AI systems. The complexity of these interactions presented both challenges and opportunities, fueling my desire to unravel the intricacies of this unique combination of technologies. This book is a realization of my initial exploration as well as many years of hands-on ML and AI system design experience. It is created for engineers, researchers, and enthusiasts who are eager to understand the underpinnings of AI systems performance at all levels. Whether you’re building AI applications, optimizing neural network training strategies, designing and managing scalable inference servers, or simply fascinated by the mechanics of modern AI systems, this book provides the insights that bridge theory and practice across multiple disciplines. Throughout the chapters, we will embark on a journey that examines the evolution of hardware architectures, dive into the nuances of software optimization, and explore real-world case studies that highlight the patterns and best practices of building both high-performance and cost-efficient AI systems. Each section is designed to build upon the last, creating a cohesive narrative from foundational concepts to advanced applications.

Although I’m the sole author of this book, this was not a sole endeavor. I am indebted to the brilliant minds whose research and innovations have paved the way for the topics covered in this book. Their contributions have been instrumental in shaping the content and depth of this work. To my colleagues, mentors, and reviewers who challenged my perspectives and enriched my understanding, your insights are embedded in every chapter. And to my family and friends whose continuous support kept me writing through the night and into the early morning hours, I extend much heartfelt gratitude. As we continue to push the frontiers of artificial intelligence and supercomputing, the knowledge in this book aims to inspire, educate, and empower the reader. The journey through the complexities of AI Systems Performance Engineering is not just a technical exploration - it’s a reminder of human ingenuity, the need to understand our surroundings, and a desire to continuously improve through technology and innovation. There are few people in this world who understand the fundamentals of co-designing hardware, software, and algorithms for maximum performance and efficiency. After reading this book, you will be one of them. Chris Fregly

San Francisco, California

Chapter 1. Introduction and AI System Overview A NOTE FOR EARLY RELEASE READERS With Early Release ebooks, you get books in their earliest form —the author’s raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. This will be the 1st chapter of the final book. If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out to the editor at arufino@oreilly.com. In late 2024, a small startup in China called DeepSeek.AI stunned the AI community by training a frontier large language model without access to the latest, state-of-the-art NVIDIA GPUs at the time. Due to export restrictions, DeepSeek’s engineers could not obtain top-tier NVIDIA A100 or H100 accelerators, so they resorted to locally available, less-capable NVIDIA chips.

Despite these limitations, DeepSeek.AI trained their DeepSeek- R1 model and achieved reasoning capabilities near the performance of leading frontier models that were trained on the most capable NVIDIA chips at the time. This case underscores that practitioners and researchers skilled in AI systems performance engineering can get the most out of their available hardware - no matter the constraints. For example, DeepSeek’s engineers treated communication bandwidth as a scarce resource, optimizing every byte over the wire to achieve what many thought impossible on that infrastructure. They scaled out to thousands of these constrained GPUs - connected with limited-bandwidth interconnects - using novel software and algorithmic optimizations to overcome these limitations. Contrast DeepSeek’s approach with the “brute force” path taken by the largest AI frontier labs in the U.S. and Europe. These labs continue to pursue larger compute clusters and larger models. Model sizes have exploded from millions to billions, and now to trillions of parameters. And while each 10× increase in scale has unlocked qualitatively new capabilities, they require tremendous cost and resources.

For instance, OpenAI’s GPT-3 (175B parameters, 2020) cost on the order of $4 million to train and GPT-4 (2023) required an estimated $78 million. Google’s Gemini Ultra (2023) soared to a staggering ~$191 million. Figure 1-1 illustrates this ballooning of training expenses – from under $10M around 2019 to well over $100M by 2023 for state-of-the-art models.

Figure 1-1. The cost to train cutting-edge AI models has skyrocketed (Source: posts.voronoiapp.com)

DeepSeek claims that their DeepSeek-R1 model was trained for only $6 million in compute – an order of magnitude lower than models like GPT-4 and Gemini – while matching performance of rival models that cost orders-of-magnitude more money. While there was some doubt as to the validity of the $6 million claim, the announcement briefly shocked the U.S. financial market including NVIDIA’s stock which dropped 17% on the news, amid concerns that less NVIDIA hardware would be needed in the future. While this market reaction was a bit overblown, it underscores the significant financial impact of such AI efficiency breakthroughs on the global financial markets. Beyond model training, DeepSeek boasts significant inference efficiency gains through novel hardware-aware algorithmic improvements to the Transformer architecture which powers most modern, frontier large-language models. DeepSeek has clearly demonstrated that clever AI systems performance engineering optimizations can upend the economics of ultra- scale AI model training and inference. The takeaway is a profound realization that, at these scales, every bit of performance squeezed out of our systems could translate to millions, or even billions, of money saved. Every

bottleneck eliminated can have an outsized impact on training throughput and inference latency. This, in turn, reduces cost and increases overall end-user happiness. In short, AI systems performance engineering isn’t just about speed – it’s about making the previously impossible both possible and affordable. In Chapter 1, we embark on an in-depth exploration of the AI Systems Performance Engineer — a role that has become pivotal in the era of large-scale artificial intelligence. This chapter serves as a comprehensive guide to understanding the multifaceted responsibilities and the critical impact of this profession on modern AI systems. We begin by tracing the evolution of AI workloads, highlighting the transition from traditional computing paradigms to the demands of contemporary AI applications. This context sets the stage for appreciating the necessity of specialized performance engineering in AI. The chapter then dives into the core competencies required for an AI Systems Performance Engineer. We examine the technical proficiencies essential for the role, including a deep understanding of hardware architectures, software optimization techniques, and system-level integration. Additionally, we discuss the importance of soft skills such as

problem-solving, communication, and collaboration, which are vital for navigating the interdisciplinary nature of AI projects. A significant portion of the chapter is dedicated to the practical aspects of the role. We explore how performance engineers analyze system bottlenecks, implement optimization strategies, and ensure the scalability and reliability of AI systems. Real- world scenarios and case studies are presented to illustrate these concepts, providing tangible examples of challenges and solutions encountered in the field. Furthermore, we discuss the tools and methodologies commonly employed by performance engineers, offering insights into performance testing, monitoring, and benchmarking practices. This includes an overview of industry- standard tools and how they are applied to assess and enhance system performance. By the end of Chapter 1, readers will have a thorough understanding of the AI Systems Performance Engineer’s role, the skills required to excel in this position, and the critical importance of performance engineering in the successful deployment and operation of AI systems. This foundational knowledge sets the stage for the subsequent chapters, where we

delve deeper into specific techniques, technologies, and best practices that define excellence in AI performance engineering. The AI Systems Performance Engineer AI Systems Performance Engineer is a specialized role focused on optimizing the performance of AI models and the underlying systems they run on. These engineers ensure that AI training and inference pipelines run fast, cost-efficiently, and with maximum performance. As the scale increases, the AI Systems Performance Engineer becomes even more critical. AI Systems Performance Engineers command top salaries, and for good reason. Our work has a clear impact on the bottom line. We blend expertise across hardware, software, and algorithms. We must understand low-level OS considerations, memory hierarchies, networking fundamentals, and multiple languages like Python and C++. On any given day, an AI Systems Performance Engineer might be examining low-level GPU kernel efficiency, optimizing OS thread scheduling, analyzing memory access patterns, increasing network throughput efficiency, or debugging

distributed training algorithms. Key responsibilities of an AI Systems Performance Engineer include benchmarking, profiling, debugging, optimizing, scaling, and managing resources efficiently. Benchmarking and Profiling Benchmarking and profiling involves measuring latency, throughput, memory usage, and other performance metrics for AI models under various workloads. To identify bottlenecks, we must iteratively use profiling tools such as NVIDIA Nsight and PyTorch profiler to track performance over time as we make controlled enhancements. It’s important to set up automated performance tests to catch regressions early Debugging and optimizing requires that we trace performance issues to their root cause whether it’s a suboptimal CUDA kernel, an unnecessary communication overhead, or an imbalance in our training or inference workload. In one case, we may want to use more-efficient matrix operations that take advantage of the latest Transformer Engine hardware. In another case, we can improve the software framework by configuring a higher-degree of parallelism for our “embarrassingly-parallel” inference workload. In yet

Statistics

Uploader

AI Systems Performance Engineering (for Raymond Rhine) (Chris Fregly) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Statistics

Uploader

AI Systems Performance Engineering (for Raymond Rhine) (Chris Fregly) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment