M A N N I N G Pekka Enberg Reduce delay in software systems
Latency principles inside this book Latency principle Section Latency is the time delay between a cause and its observed effect. 1.1 Latency constants for CPU, memory, and I/O, for ballpark estimations 1.2 Human perception latency constants for design targets 1.3.1 Little’s law connects latency, throughput, and concurrency. 2.1.1 Amdahl’s law shows speedup from parallelization. 2.1.2 Latency is a distribution, not a single value. 2.2 Common sources of latency 2.3 Every component of latency compounds 2.4 Measuring latency correctly 2.5 Geographical and last-mile latency are major bottlenecks. 3.2.1 Consistency models determine a baseline for latency. 4.3 Replication strategies determine latency, availability, and scalability. 4.4 State machine replication trades off latency for strong consistency. 4.6 Physical partitioning slices data into smaller sets for lower latency. 5.2 Logical partitioning slices data based on workload for lower latency. 5.3 Request routing determines effective utilization of partitions. 5.4 Partition imbalances harm latency and efficiency. 5.5 (Continued on inside back cover)
(This page has no text content)
MANN I NG Shelter ISland Pekka Enberg Latency Reduce delay in software systems
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com © 2026 Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was awware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid- free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. ∞ Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 ISBN 9781633438088 Printed in the United States of America The author and publisher have made every effort to ensure that the information in this book was correct at press time. The author and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Development editor: Katie Sposato Johnson Technical editors: Timur Doumler and Behrad Babaee Review editor: Angelina LazukiÊ Production editor: Andy Marinkovich Copy editor: Andy Carroll Proofreader: Melody Dolab Technical proofreader: Serge Simon Typesetter: Tamara ŠveliÊ SabljiÊ Cover designer: Marija Tudor
v brief contents Part 1 Basics .............................................................................1 1 ■ Introduction 3 2 ■ Modeling and measuring latency 14 Part 2 Data .............................................................................37 3 ■ Colocation 39 4 ■ Replication 58 5 ■ Partitioning 76 6 ■ Caching 97 Part 3 Compute .................................................................... 119 7 ■ Eliminating work 121 8 ■ Wait-free synchronization 145 9 ■ Exploiting concurrency 171 Part 4 Hiding latency ........................................................ 193 10 ■ Asynchronous processing 195 11 ■ Predictive techniques 213 appendix ■ Further reading 230
vi contents preface xii acknowledgments xiii about this book xv about the author xviii about the cover illustration xix Part 1 Basics ..............................................................1 1 Introduction 3 1.1 What is latency? 4 1.2 How is latency measured? 6 1.3 Why does latency matter? 8 User experience 8 ■ Real-time systems 9 ■ Efficiency 9 1.4 What latency is not 10 1.5 Latency vs. bandwidth 10 1.6 Latency vs. energy 12 2 Modeling and measuring latency 14 2.1 Laws of latency 15 Little’s law 15 ■ Amdahl’s law 18 2.2 Latency distribution 19
viicontents 2.3 Common sources of latency 21 Physics 22 ■ CPU and hardware 22 ■ Virtualization 24 Operating system, drivers, and firmware 25 ■ Managed runtime 26 ■ Application 26 2.4 Compounding latency 26 2.5 Measuring latency 29 2.6 Putting it together: Measuring network latency 30 Plotting with histograms 31 ■ Plotting with eCDF 33 Part 2 Data ..............................................................37 3 Colocation 39 3.1 Why colocate? 40 3.2 Internode latency 41 Geographical and last-mile latency 42 ■ Edge computing and CDNs 44 3.3 Intranode latency 45 Network stack 45 ■ TCP/IP protocol 47 ■ Kernel-bypass networking 48 3.4 Multicore architecture 49 3.5 Putting it together: REST API with embedded database 51 4 Replication 58 4.1 Why replicate data? 59 4.2 Availability and scalability 60 4.3 Consistency model 61 Strong consistency 61 ■ Eventual consistency 62 Other consistency models 64 4.4 Replication strategies 64 Single-leader replication 64 ■ Multi-leader replication 65 Leaderless replication 66 ■ Read-your-writes property 67 Local-first approach 67 4.5 Asynchronous vs. synchronous replication 68 4.6 State machine replication 69 4.7 Case study: Viewstamped Replication 70 4.8 Putting it together: Replicating a key–value store 72
viii contents 5 Partitioning 76 5.1 Why partition data? 77 5.2 Physical partitioning strategies 79 Horizontal partitioning 79 ■ Vertical partitioning 84 Hybrid partitioning 85 5.3 Logical partitioning strategies 86 Functional partitioning 87 ■ Geographical partitioning 87 User-based partitioning 88 ■ Time-based partitioning 88 Overpartitioning 88 5.4 Request routing 89 Direct routing 89 ■ Proxy routing 89 ■ Forward routing 90 5.5 Partition imbalance 90 Hot partitions 91 ■ Skewed workloads 91 5.6 Putting it together: Horizontal partitioning with SQLite 92 6 Caching 97 6.1 Why cache data? 98 6.2 Caching overview 98 6.3 Caching strategies 100 Cache-aside caching 100 ■ Read-through caching 101 Write-through caching 102 ■ Write-behind caching 103 Client-side caching 104 ■ Distributed caching 104 6.4 Cache coherency 104 6.5 Cache hit ratio 106 6.6 Cache replacement 109 Least recently used (LRU) 110 ■ Least frequently used (LFU) 110 ■ First-in, first-out (FIFO) and SIEVE 111 6.7 Time-to-live (TTL) 112 6.8 Materialized views 113 6.9 Memoization 114 6.10 Putting it together: In-application caching with Moka 114
ixcontents Part 3 Compute ..................................................... 119 7 Eliminating work 121 7.1 Ways of eliminating work 122 7.2 Algorithmic complexity 123 7.3 Serializing and deserializing 125 7.4 Memory management 127 Dynamic memory allocation 128 ■ Garbage collection 129 Virtual and physical memory 130 ■ Demand paging 132 Memory topology 134 7.5 Operating system overhead 134 Scheduling delay and context switching 134 ■ Background tasks and interrupts 135 ■ Network stack 136 7.6 Precomputation 137 7.7 Putting it together: Benchmarking with Criterion 138 8 Wait-free synchronization 145 8.1 Mutual exclusion 146 Mutexes 147 ■ Read–write locks 147 ■ Spinlocks 148 8.2 Problems with mutual exclusion 148 Inefficiency 148 ■ Priority inversion 150 Convoying 150 ■ Deadlocks 151 8.3 Atomics 152 Atomic operations 152 ■ Anatomy of a spinlock 153 8.4 Memory barriers 155 Types of memory barriers 156 ■ Compiler barriers 158 Memory reordering example 158 8.5 Wait-free synchronization 160 Progress conditions 161 ■ Consensus number 163 Wait-free queues 164 ■ Wait-free stacks 164 Wait-free linked-lists 165 8.6 Putting it together: Building a single-producer, single- consumer queue 166
x contents 9 Exploiting concurrency 171 9.1 Concurrency and parallelism 172 9.2 Concurrency models 174 Threads 175 ■ Fibers 177 ■ Coroutines 177 ■ Event-driven concurrency 179 ■ Futures and promises 181 Actor model 182 9.3 Parallel processing 183 Data parallelism 183 ■ Task parallelism 185 9.4 Transactions 185 Serializability 186 ■ Snapshot isolation 187 Data anomalies and weaker isolation 188 9.5 Concurrency control 189 Two-phase locking 189 ■ Multiversion concurrency control 189 9.6 Putting it together: Sequential vs. concurrent execution 190 Part 4 Hiding latency .......................................... 193 10 Asynchronous processing 195 10.1 Fundamentals 196 Asynchronous vs. synchronous processing 196 ■ The event loop 199 ■ Challenges 202 10.2 Asynchronous I/O 202 I/O multiplexing 203 ■ Request batching 203 ■ Request hedging 203 ■ Buffered I/O 204 ■ Memory mapping 205 10.3 Deferring work 205 Task scheduling 205 ■ Priority queues 206 ■ Work stealing 206 10.4 Resource management 206 Thread pools 206 ■ Memory pools 207 ■ Connection pools 207 10.5 Managing concurrency with backpressure 208 Controlling the producer 209 ■ Buffering 209 Dropping and rate limiting 209
xicontents 10.6 Error handling 210 Partial errors 210 ■ Recovery 210 ■ Timeouts and cancellation 210 10.7 Observability 211 Tracing 211 ■ Metrics 211 11 Predictive techniques 213 11.1 Introduction to predictive techniques 214 11.2 Prefetching 215 Pattern-based prefetching 216 ■ Semantic prefetching 219 11.3 Optimistic updates 219 Optimistic view 220 ■ Synchronizing optimistic updates 220 Consistency guarantees 223 ■ Error handling and rollbacks 223 11.4 Speculative execution 224 Incremental computation 225 ■ Parallel speculation 226 Value prediction 227 11.5 Predictive resource allocation 227 Overprovisioning 228 ■ Prewarming 228 appendix Further reading 230 index 236
xii preface Over the years, I’ve worked on many latency-related problems, and I’ve often had to figure things out on the fly, first identifying where the latency was coming from and then figuring out how to fix it. There’s plenty of useful information scattered across the internet in blog posts, mailing lists, and forum discussions, but I never had a com- prehensive resource to turn to when designing and optimizing for low latency. The existing performance books, while excellent, focus on making programs run faster by reducing CPU usage or improving algorithmic efficiency. They miss the bigger picture of latency optimization techniques like colocation, replication strategies, and wait-free synchronization, which can have a far more dramatic impact on response times. Latency: Reduce delay in software systems fills that gap. It’s the systematic guide I wish I’d had when I first started tackling latency problems. It brings together the scattered knowledge, ranging from hardware optimization to distributed systems design, into one practical resource.
xiii acknowledgments First and foremost, thanks to my wife, Minna, and our children, Isak, Noah, and Elsa, for putting up with me disappearing into my office to write this book. Also, thanks to my mom, Erja, and dad, Rainer, for getting a computer at our home three decades ago, which put me on this path in the first place. This book would honestly not exist without the fantastic Manning team: Suresh Jain for his persistence in convincing me to write a book proposal, Michael Stephens for shaping the idea into a comprehensive guide to low-latency patterns, and Katie Spo- sato Johnson for keeping me on track through the challenging process of writing while doing a million other things, like starting a company and pursuing a PhD. Thanks to Behrad Babaee and Timmer Doumler, my technical editors, for making the book so much better by relentlessly making sure that what I was writing was not only technically accurate but also clearly written. I’m also grateful to the people who shaped my understanding of latency optimi- zation: Ashwin Rao, who taught me to model and measure latency as a distribution during my master’s thesis work; Christoph Lameter, who showed me how to write truly latency-sensitive code while maintaining Linux kernel memory allocators; and Avi Kiv- ity, who expanded my understanding of building for low latency—from Little’s law to thread-per-core architectures—during my time working with him and the team on the OSv unikernel and Scylla database. To all the reviewers—Alex Rios, Alex Yu, Anindya Dey, Anurag Kumar Jain, Arijit Dasgupta, Arjun Mullick, Arjun Sk, Artur Baruchi, Burhan ul Haq, Charles Chan, Chris- tian Bach, Fernando Bernardino, Filipe Teixeira, James Watson, Jens Christian Bre- dahl Madsen, João Marcelo Borovina Josko, Johannes Lochmann, Jonathan R. Martin, Jorge Bo, Kanak Kshetri, Karel Rank, Khrystyna Terletska, Lakshminarayanan AS, Peter
xiv acknowledgments Hampton, Ramzi Maâlej, Rene G. Perrin, Richard Vaughan, Satadur Roy, Shubham Patel, Stefan Turalski, Thad Meyer, Timothy Beck, and Valerie Parham-Thompson— your suggestions helped make this a better book. Thank you everyone!
xv about this book Latency: Reduce delay in software systems was written to help you tackle the complex chal- lenges of reducing delay in software systems. It provides a comprehensive, practical guide to understanding, measuring, and optimizing latency across all layers of your application stack. Who should read this book? Latency: Reduce delay in software systems is for software developers who need to solve laten- cy-related problems in their applications. Whether you’re building high-frequency trading systems, real-time gaming platforms, interactive web applications, or any sys- tem where response time matters, this book will give you the tools and knowledge to succeed. The book assumes you have a working knowledge of building applications and back- ends, along with some basic understanding of distributed systems and databases. With this foundation, you’ll be able to follow all the concepts and techniques presented. However, the book goes deep enough into the implementation details and theoreti- cal foundations that even experienced developers who have done some latency optimi- zation work will discover new techniques and fill gaps in their knowledge. Each chapter includes practical examples, code implementations, and real-world case studies to rein- force the concepts. How this book is organized: A roadmap This book is structured to first present techniques that address the largest sources of latency and provide the most significant improvements, and then progress to more
xvi about this book specialized optimizations with diminishing returns. This approach allows you to priori- tize your optimization efforts for maximum impact. Part 1 establishes the fundamental concepts you need to understand latency opti- mization. You should read this part regardless of your background, as it provides the foundation for everything that follows: ¡ Chapter 1 defines what latency is, why it matters for the user experience and sys- tem efficiency, and how it differs from bandwidth. ¡ Chapter 2 covers essential modeling techniques like Little’s law and Amdahl’s law, explains latency distributions, and teaches you how to measure and visualize latency in your systems properly. Part 2 focuses on optimizations related to data storage and access patterns: ¡ Chapter 3 explores colocation strategies, from geographical considerations to intranode optimizations, including kernel-bypass networking. ¡ Chapter 4 covers replication techniques, consistency models, and approaches such as single-leader and multi-leader replication. ¡ Chapter 5 discusses partitioning strategies, both physical (horizontal, verti- cal) and logical (functional, geographical, time-based), plus request-routing techniques. ¡ Chapter 6 rounds out this part with comprehensive caching strategies, from cache-aside to distributed caching, along with coherency and replacement policies. Part 3 addresses optimizations in your application code and processing logic: ¡ Chapter 7 teaches you how to eliminate unnecessary work through algorithmic improvements, better memory management, and precomputation techniques. ¡ Chapter 8 dives deep into wait-free synchronization, covering atomic operations, memory barriers, and lock-free data structures. ¡ Chapter 9 explores concurrency models, parallel processing techniques, and transaction management to maximize your system’s processing capabilities. Part 4 presents techniques for when you can’t eliminate latency but need to minimize its impact: ¡ Chapter 10 covers asynchronous processing fundamentals, including event loops, I/O multiplexing, request batching, and resource management. ¡ Chapter 11 explores predictive techniques like prefetching, optimistic updates, and speculative execution that can make your system feel more responsive even when the underlying operations have inherent delays. Each chapter follows a consistent pattern: it explains the theory behind the tech- nique, provides practical implementation guidance, and concludes with a “Putting it together” section that demonstrates the concept with working code examples. The
xviiabout this book book also includes key patterns and principles that you can apply across different tech- nologies and architectures. Whether you read the book from cover to cover or focus on specific areas relevant to your current challenges, you’ll gain both the theoretical understanding and practical skills needed to build truly low-latency systems. About the code This book contains source code examples as both listings and in line with the text, for- matted in a fixed-width font. To make the examples easy to follow in the text, they are edited for brevity in the book. You can get executable snippets of code from the liveBook (online) version of this book at https://livebook.manning.com/book/latency. The complete code for the examples in the book is available for download from the Manning website at www .manning.com and from GitHub at https://github.com/penberg/latency-book with instructions on how to build and run them. Throughout the book, I use Rust and Python as the two languages to showcase exam- ples of low latency techniques and how to evaluate them. The examples are written in a way that hopefully makes them easy to follow, even for people not familiar with Rust. However, if you want to learn more Rust, I recommend checking out The Rust Program- ming Language by Steve Klabnik and Carol Nichols (No Starch Press, 2022, and also online at https://doc.rust-lang.org/book/). liveBook discussion forum Purchase of Latency: Reduce delay in software systems includes free access to liveBook, Manning’s online reading platform. Using liveBook’s exclusive discussion features, you can attach comments to the book globally or to specific sections or paragraphs. It’s a snap to make notes for yourself, ask and answer technical questions, and receive help from the author and other users. To access the forum, go to https://livebook .manning.com/book/latency/discussion. Manning’s commitment to our readers is to provide a venue where a meaningful dia- logue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the author some challenging questions lest his interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website for as long as the book is in print.
xviii about the author Pekka Enberg has been working on systems involving low latency for nearly two decades. Previously, he worked on the Linux kernel as a maintainer of the dynamic memory allocator subsystem. Pekka also worked as an early employee at ScyllaDB, building a low-latency, high-throughput Apache Cassandra- compatible distrib- uted database. Today, Pekka works at Turso, creating the next evolution of SQLite. You can learn more about Pekka at https://penberg.org.
Comments 0
Loading comments...
Reply to Comment
Edit Comment