Kafka The Definitive Guide Real-Time Data and Stream Processing at Scale (Gwen Shapira, Todd Palino, Rajini Sivaram etc.) (Z-Library)

Author: Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty

教育

Every enterprise application creates data, whether it consists of log messages, metrics, user activity, or outgoing messages. Moving all this data is just as important as the data itself. With this updated edition, application architects, developers, and production engineers new to the Kafka streaming platform will learn how to handle data in motion. Additional chapters cover Kafka's AdminClient API, transactions, new security features, and tooling changes. Engineers from Confluent and LinkedIn responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream processing applications with this platform. Through detailed examples, you'll learn Kafka's design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer. You'll examine: Best practices for deploying and configuring Kafka Kafka producers and consumers for writing and reading messages Patterns and use-case requirements to ensure reliable data delivery Best practices for building data pipelines and applications with Kafka How to perform monitoring, tuning, and maintenance tasks with Kafka in production The most critical metrics among Kafka's operational measurements Kafka's delivery capabilities for stream processing systems

📄 File Format: PDF

💾 File Size: 6.0 MB

Views

Downloads

0.00

Total Donations

📖 Read Online ⬇️ Download

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

📄 Page 1

Second Edition Gwen Shapira, Todd Palino, Rajini Sivaram & Krit Petty Kafka The Definitive Guide Real-Time Data and Stream Processing at Scale

📄 Page 2

DATA | DATABA SES “A must-have for developers and operators alike. You need this book if you’re using or running Kafka.” —Chris Riccomini Software Engineer, Startup Advisor, and Coauthor of The Missing README Kafka: The Definitive Guide ISBN: 978-1-492-04308-9 US $69.99 CAN $92.99 Twitter: @oreillymedia facebook.com/oreilly Every enterprise application creates data, whether it consists of log messages, metrics, user activity, or outgoing messages. Moving all this data is just as important as the data itself. With this updated edition, application architects, developers, and production engineers new to the Kafka streaming platform will learn how to handle data in motion. Additional chapters cover Kafka’s AdminClient API, transactions, new security features, and tooling changes. Engineers from Confluent and LinkedIn responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream processing applications with this platform. Through detailed examples, you’ll learn Kafka’s design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer. You’ll examine: • Best practices for deploying and configuring Kafka • Kafka producers and consumers for writing and reading messages • Patterns and use-case requirements to ensure reliable data delivery • Best practices for building data pipelines and applications with Kafka • How to perform monitoring, tuning, and maintenance tasks with Kafka in production • The most critical metrics among Kafka’s operational measurements • Kafka’s delivery capabilities for stream processing systems Gwen Shapira is an engineering leader at Confluent and manages the cloud native Kafka team, which is responsible for Kafka performance, elasticity, and multitenancy. Todd Palino, principal staff engineer in site reliability at LinkedIn, is responsible for capacity and efficiency planning. Rajini Sivaram is a principal engineer at Confluent, designing and developing cross-cluster replication and security features for Kafka. Krit Petty is the site reliability engineering manager for Kafka at LinkedIn.

📄 Page 3

Praise for Kafka: The Definitive Guide Kafka: The Definitive Guide has everything you need to know to get the most from Kafka, whether in the cloud or on-prem. A must-have for developers and operators alike. Gwen, Todd, Rajini, and Krit jam years of wisdom into one concise book. You need this book if you’re using or running Kafka. —Chris Riccomini, software engineer, startup advisor, and coauthor of The Missing README A comprehensive guide to the fundamentals of Kafka and how to operationalize it. —Sumant Tambe, senior software engineer at Linkedin This book is an essential read for any Kafka developer or administrator. Read it cover to cover to immerse yourself in its details, or keep it on hand for quick reference. Either way, its clarity of writing and technical accuracy is superb. —Robin Moffatt, staff developer advocate at Confluent This is foundational literature for all engineers interested in Kafka. It was critical in helping Robinhood navigate the scaling, upgrading, and tuning of Kafka to support our rapid user growth. —Jaren M. Glover, early engineer at Robinhood, angel investor A must-read for everyone who works with Apache Kafka: developer or admin, beginner or expert, user or contributor. —Matthias J. Sax, software engineer at Confluent and Apache Kafka PMC member

📄 Page 4

Great guidance for any team seriously using Apache Kafka in production, and engineers working on distributed systems in general. This book goes far beyond the usual introductory-level coverage and into how Kafka actually works, how it should be used, and where the pitfalls lie. For every great Kafka feature, the authors clearly list the caveats you’d only hear about from grizzled Kafka veterans. This information is not easily available in one place anywhere else. The clarity and depth of explanations is such that I would even recommend it to engineers who do not use Kafka: learning about the principles, design choices, and operational gotchas will help them make better decisions when creating other systems. —Dmitriy Ryaboy, VP of software engineering at Zymergen

📄 Page 5

Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty Kafka: The Definitive Guide Real-Time Data and Stream Processing at Scale SECOND EDITION Boston Farnham Sebastopol TokyoBeijing

📄 Page 6

978-1-492-04308-9 [LSI] Kafka: The Definitive Guide by Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty Copyright © 2022 Chen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jessica Haberman Development Editor: Gary O’Brien Production Editor: Kate Galloway Copyeditor: Sonia Saruba Proofreader: Piper Editorial Consulting, LLC Indexer: Ellen Troutman-Zaig Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea September 2017: First Edition November 2021: Second Edition Revision History for the Second Edition 2021-11-05: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492043089 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Kafka: The Definitive Guide, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Confluent. See our statement of editorial independence.

📄 Page 7

Table of Contents Foreword to the Second Edition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Foreword to the First Edition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi 1. Meet Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Publish/Subscribe Messaging 1 How It Starts 2 Individual Queue Systems 3 Enter Kafka 4 Messages and Batches 4 Schemas 5 Topics and Partitions 5 Producers and Consumers 6 Brokers and Clusters 8 Multiple Clusters 9 Why Kafka? 10 Multiple Producers 10 Multiple Consumers 10 Disk-Based Retention 11 Scalable 11 High Performance 11 Platform Features 11 The Data Ecosystem 12 Use Cases 12 Kafka’s Origin 14 v

📄 Page 8

LinkedIn’s Problem 14 The Birth of Kafka 15 Open Source 16 Commercial Engagement 16 The Name 17 Getting Started with Kafka 17 2. Installing Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Environment Setup 19 Choosing an Operating System 19 Installing Java 19 Installing ZooKeeper 20 Installing a Kafka Broker 23 Configuring the Broker 24 General Broker Parameters 25 Topic Defaults 27 Selecting Hardware 33 Disk Throughput 33 Disk Capacity 34 Memory 34 Networking 35 CPU 35 Kafka in the Cloud 35 Microsoft Azure 36 Amazon Web Services 36 Configuring Kafka Clusters 36 How Many Brokers? 37 Broker Configuration 38 OS Tuning 38 Production Concerns 42 Garbage Collector Options 42 Datacenter Layout 43 Colocating Applications on ZooKeeper 44 Summary 46 3. Kafka Producers: Writing Messages to Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Producer Overview 48 Constructing a Kafka Producer 50 Sending a Message to Kafka 52 Sending a Message Synchronously 52 Sending a Message Asynchronously 53 vi | Table of Contents

📄 Page 9

Configuring Producers 54 client.id 55 acks 55 Message Delivery Time 56 linger.ms 59 buffer.memory 59 compression.type 59 batch.size 59 max.in.flight.requests.per.connection 60 max.request.size 60 receive.buffer.bytes and send.buffer.bytes 61 enable.idempotence 61 Serializers 61 Custom Serializers 62 Serializing Using Apache Avro 64 Using Avro Records with Kafka 65 Partitions 68 Headers 71 Interceptors 71 Quotas and Throttling 73 Summary 75 4. Kafka Consumers: Reading Data from Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Kafka Consumer Concepts 77 Consumers and Consumer Groups 77 Consumer Groups and Partition Rebalance 80 Static Group Membership 83 Creating a Kafka Consumer 84 Subscribing to Topics 85 The Poll Loop 86 Thread Safety 87 Configuring Consumers 88 fetch.min.bytes 88 fetch.max.wait.ms 88 fetch.max.bytes 89 max.poll.records 89 max.partition.fetch.bytes 89 session.timeout.ms and heartbeat.interval.ms 89 max.poll.interval.ms 90 default.api.timeout.ms 90 request.timeout.ms 90 Table of Contents | vii

📄 Page 10

auto.offset.reset 91 enable.auto.commit 91 partition.assignment.strategy 91 client.id 93 client.rack 93 group.instance.id 93 receive.buffer.bytes and send.buffer.bytes 93 offsets.retention.minutes 93 Commits and Offsets 94 Automatic Commit 95 Commit Current Offset 96 Asynchronous Commit 97 Combining Synchronous and Asynchronous Commits 99 Committing a Specified Offset 100 Rebalance Listeners 101 Consuming Records with Specific Offsets 104 But How Do We Exit? 105 Deserializers 106 Custom Deserializers 107 Using Avro Deserialization with Kafka Consumer 109 Standalone Consumer: Why and How to Use a Consumer Without a Group 110 Summary 111 5. Managing Apache Kafka Programmatically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 AdminClient Overview 114 Asynchronous and Eventually Consistent API 114 Options 114 Flat Hierarchy 115 Additional Notes 115 AdminClient Lifecycle: Creating, Configuring, and Closing 115 client.dns.lookup 116 request.timeout.ms 117 Essential Topic Management 118 Configuration Management 121 Consumer Group Management 123 Exploring Consumer Groups 123 Modifying Consumer Groups 125 Cluster Metadata 127 Advanced Admin Operations 127 Adding Partitions to a Topic 127 Deleting Records from a Topic 128 viii | Table of Contents

📄 Page 11

Leader Election 128 Reassigning Replicas 129 Testing 131 Summary 133 6. Kafka Internals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Cluster Membership 135 The Controller 136 KRaft: Kafka’s New Raft-Based Controller 137 Replication 139 Request Processing 142 Produce Requests 144 Fetch Requests 145 Other Requests 147 Physical Storage 149 Tiered Storage 149 Partition Allocation 151 File Management 152 File Format 153 Indexes 155 Compaction 156 How Compaction Works 156 Deleted Events 158 When Are Topics Compacted? 159 Summary 159 7. Reliable Data Delivery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Reliability Guarantees 162 Replication 163 Broker Configuration 164 Replication Factor 165 Unclean Leader Election 166 Minimum In-Sync Replicas 167 Keeping Replicas In Sync 168 Persisting to Disk 169 Using Producers in a Reliable System 169 Send Acknowledgments 170 Configuring Producer Retries 171 Additional Error Handling 171 Using Consumers in a Reliable System 172 Important Consumer Configuration Properties for Reliable Processing 173 Table of Contents | ix

📄 Page 12

Explicitly Committing Offsets in Consumers 174 Validating System Reliability 176 Validating Configuration 176 Validating Applications 177 Monitoring Reliability in Production 178 Summary 180 8. Exactly-Once Semantics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Idempotent Producer 182 How Does the Idempotent Producer Work? 182 Limitations of the Idempotent Producer 184 How Do I Use the Kafka Idempotent Producer? 185 Transactions 186 Transactions Use Cases 187 What Problems Do Transactions Solve? 187 How Do Transactions Guarantee Exactly-Once? 188 What Problems Aren’t Solved by Transactions? 191 How Do I Use Transactions? 193 Transactional IDs and Fencing 196 How Transactions Work 198 Performance of Transactions 200 Summary 201 9. Building Data Pipelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Considerations When Building Data Pipelines 204 Timeliness 204 Reliability 205 High and Varying Throughput 205 Data Formats 206 Transformations 207 Security 208 Failure Handling 209 Coupling and Agility 209 When to Use Kafka Connect Versus Producer and Consumer 210 Kafka Connect 211 Running Kafka Connect 211 Connector Example: File Source and File Sink 214 Connector Example: MySQL to Elasticsearch 216 Single Message Transformations 223 A Deeper Look at Kafka Connect 225 Alternatives to Kafka Connect 229 x | Table of Contents

📄 Page 13

Ingest Frameworks for Other Datastores 229 GUI-Based ETL Tools 229 Stream Processing Frameworks 230 Summary 230 10. Cross-Cluster Data Mirroring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Use Cases of Cross-Cluster Mirroring 234 Multicluster Architectures 235 Some Realities of Cross-Datacenter Communication 235 Hub-and-Spoke Architecture 236 Active-Active Architecture 238 Active-Standby Architecture 240 Stretch Clusters 246 Apache Kafka’s MirrorMaker 247 Configuring MirrorMaker 249 Multicluster Replication Topology 251 Securing MirrorMaker 252 Deploying MirrorMaker in Production 253 Tuning MirrorMaker 257 Other Cross-Cluster Mirroring Solutions 259 Uber uReplicator 259 LinkedIn Brooklin 260 Confluent Cross-Datacenter Mirroring Solutions 261 Summary 263 11. Securing Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Locking Down Kafka 265 Security Protocols 268 Authentication 269 SSL 270 SASL 275 Reauthentication 286 Security Updates Without Downtime 288 Encryption 289 End-to-End Encryption 289 Authorization 291 AclAuthorizer 292 Customizing Authorization 295 Security Considerations 297 Auditing 298 Securing ZooKeeper 299 Table of Contents | xi

📄 Page 14

SASL 299 SSL 300 Authorization 301 Securing the Platform 301 Password Protection 301 Summary 303 12. Administering Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Topic Operations 305 Creating a New Topic 306 Listing All Topics in a Cluster 308 Describing Topic Details 308 Adding Partitions 310 Reducing Partitions 311 Deleting a Topic 311 Consumer Groups 312 List and Describe Groups 312 Delete Group 313 Offset Management 314 Dynamic Configuration Changes 315 Overriding Topic Configuration Defaults 315 Overriding Client and User Configuration Defaults 317 Overriding Broker Configuration Defaults 318 Describing Configuration Overrides 319 Removing Configuration Overrides 319 Producing and Consuming 320 Console Producer 320 Console Consumer 322 Partition Management 326 Preferred Replica Election 326 Changing a Partition’s Replicas 327 Dumping Log Segments 332 Replica Verification 334 Other Tools 334 Unsafe Operations 335 Moving the Cluster Controller 335 Removing Topics to Be Deleted 336 Deleting Topics Manually 336 Summary 337 xii | Table of Contents

📄 Page 15

13. Monitoring Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Metric Basics 339 Where Are the Metrics? 339 What Metrics Do I Need? 341 Application Health Checks 343 Service-Level Objectives 343 Service-Level Definitions 343 What Metrics Make Good SLIs? 344 Using SLOs in Alerting 345 Kafka Broker Metrics 346 Diagnosing Cluster Problems 347 The Art of Under-Replicated Partitions 348 Broker Metrics 354 Topic and Partition Metrics 364 JVM Monitoring 366 OS Monitoring 367 Logging 369 Client Monitoring 370 Producer Metrics 370 Consumer Metrics 373 Quotas 376 Lag Monitoring 377 End-to-End Monitoring 378 Summary 378 14. Stream Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 What Is Stream Processing? 382 Stream Processing Concepts 385 Topology 385 Time 386 State 388 Stream-Table Duality 389 Time Windows 390 Processing Guarantees 392 Stream Processing Design Patterns 392 Single-Event Processing 392 Processing with Local State 393 Multiphase Processing/Repartitioning 395 Processing with External Lookup: Stream-Table Join 396 Table-Table Join 398 Streaming Join 398 Table of Contents | xiii

📄 Page 16

Out-of-Sequence Events 399 Reprocessing 400 Interactive Queries 401 Kafka Streams by Example 402 Word Count 402 Stock Market Statistics 405 ClickStream Enrichment 408 Kafka Streams: Architecture Overview 410 Building a Topology 410 Optimizing a Topology 411 Testing a Topology 411 Scaling a Topology 412 Surviving Failures 415 Stream Processing Use Cases 416 How to Choose a Stream Processing Framework 417 Summary 419 A. Installing Kafka on Other Operating Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 B. Additional Kafka Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 xiv | Table of Contents

📄 Page 17

Foreword to the Second Edition The first edition of Kafka: The Definitive Guide was published five years ago. At the time, we estimated that Apache Kafka was used in 30% of Fortune 500 companies. Today, over 70% of Fortune 500 companies are using Apache Kafka. It is still one of the most popular open source projects in the world and is at the center of a huge ecosystem. Why all the excitement? I think it is because there has been a huge gap in our infra‐ structure for data. Traditionally, data management was all about storage—the file stores and databases that keep our data safe and let us look up the right bit at the right time. Huge amounts of intellectual energy and commercial investment have been poured into these systems. But a modern company isn’t just one piece of software with one database. A modern company is an incredibly complex system built out of hundreds or even thousands of custom applications, microservices, databases, SaaS layers, and analytics platforms. And increasingly, the problem we face is how to con‐ nect all this up into one company and make it all work together in real time. This problem isn’t about managing data at rest—it is about managing data in motion. And right at the heart of that movement is Apache Kafka, which has become the de facto foundation to any platform for data in motion. Through this journey, Kafka hasn’t remained static. What started as a bare-bones commit log has evolved as well: adding connectors and stream processing capabili‐ ties, and reinventing its own architecture along the way. The community not only evolved existing APIs, configuration options, metrics, and tools to improve Kafka’s usability and reliability, but we’ve also introduced a new programmatic administra‐ tion API, the next generation of global replication and DR with MirrorMaker 2.0, a new Raft-based consensus protocol that allows for running Kafka in a single exe‐ cutable, and true elasticity with tiered storage support. Perhaps most importantly, we’ve made Kafka a no-brainer in critical enterprise use cases by adding support for advanced security options—authentication, authorization, and encryption. xv

📄 Page 18

As Kafka evolves, we see the use cases evolve as well. When the first edition was pub‐ lished, most Kafka installations were still in traditional on-prem data centers using traditional deployment scripts. The most popular use cases were ETL and messaging; stream processing use cases were still taking their first steps. Five years later, most Kafka installations are in the cloud, and many are running on Kubernetes. ETL and messaging are still popular, but they are joined by event-driven microservices, real- time stream processing, IoT, machine learning pipelines, and hundreds of industry- specific use cases and patterns that range from claims processing in insurance companies to trading systems in banks to helping power real-time game play and per‐ sonalization in video games and streaming services. Even as Kafka expands to new environments and use cases, writing applications that use Kafka well and deploy it confidently in production requires acclimating to Kafka’s unique way of thinking. This book covers everything developers and SREs need to use Kafka to its full potential, from the most basic APIs and configuration to the latest and most cutting-edge capabilities. It covers not just what you can do with Kafka and how to do it, but also what not to do and antipatterns to avoid. This book can be a trusted guide to the world of Kafka for both new users and experienced practitioners. — Jay Kreps Cofounder and CEO at Confluent xvi | Foreword to the Second Edition

📄 Page 19

Foreword to the First Edition It’s an exciting time for Apache Kafka. Kafka is being used by tens of thousands of organizations, including over a third of the Fortune 500 companies. It’s among the fastest-growing open source projects and has spawned an immense ecosystem around it. It’s at the heart of a movement toward managing and processing streams of data. So where did Kafka come from? Why did we build it? And what exactly is it? Kafka got its start as an internal infrastructure system we built at LinkedIn. Our observation was really simple: there were lots of databases and other systems built to store data, but what was missing in our architecture was something that would help us to handle the continuous flow of data. Prior to building Kafka, we experimented with all kinds of off-the-shelf options, from messaging systems to log aggregation and ETL tools, but none of them gave us what we wanted. We eventually decided to build something from scratch. Our idea was that instead of focusing on holding piles of data like our relational databases, key-value stores, search indexes, or caches, we would focus on treating data as a continually evolving and ever-growing stream and build a data system—and indeed a data architecture—ori‐ ented around that idea. This idea turned out to be even more broadly applicable than we expected. Though Kafka got its start powering real-time applications and data flow behind the scenes of a social network, you can now see it at the heart of next-generation architectures in every industry imaginable. Big retailers are reworking their fundamental business processes around continuous data streams, car companies are collecting and process‐ ing real-time data streams from internet-connected cars, and banks are rethinking their fundamental processes and systems around Kafka as well. So what is this Kafka thing all about? How does it compare to the systems you already know and use? xvii

📄 Page 20

We’ve come to think of Kafka as a streaming platform: a system that lets you publish and subscribe to streams of data, store them, and process them, and that is exactly what Apache Kafka is built to be. Getting used to this way of thinking about data might be a little different than what you’re used to, but it turns out to be an incredibly powerful abstraction for building applications and architectures. Kafka is often com‐ pared to a couple of existing technology categories: enterprise messaging systems, big data systems like Hadoop, and data integration or ETL tools. Each of these compari‐ sons has some validity but also falls a little short. Kafka is like a messaging system in that it lets you publish and subscribe to streams of messages. In this way, it is similar to products like ActiveMQ, RabbitMQ, IBM’s MQSeries, and other products. But even with these similarities, Kafka has a number of core differences from traditional messaging systems that make it another kind of animal entirely. Here are the big three differences: first, it works as a modern dis‐ tributed system that runs as a cluster and can scale to handle all the applications in even the most massive of companies. Rather than running dozens of individual mes‐ saging brokers, hand wired to different apps, this lets you have a central platform that can scale elastically to handle all the streams of data in a company. Second, Kafka is a true storage system built to store data for as long as you might like. This has huge advantages in using it as a connecting layer as it provides real delivery guarantees—its data is replicated, persistent, and can be kept around as long as you like. Finally, the world of stream processing raises the level of abstraction quite significantly. Messag‐ ing systems mostly just hand out messages. The stream processing capabilities in Kafka let you compute derived streams and datasets dynamically off of your streams with far less code. These differences make Kafka enough of its own thing that it doesn’t really make sense to think of it as “yet another queue.” Another view on Kafka—and one of our motivating lenses in designing and building it—was to think of it as a kind of real-time version of Hadoop. Hadoop lets you store and periodically process file data at a very large scale. Kafka lets you store and contin‐ uously process streams of data, also at a large scale. At a technical level, there are defi‐ nitely similarities, and many people see the emerging area of stream processing as a superset of the kind of batch processing people have done with Hadoop and its vari‐ ous processing layers. What this comparison misses is that the use cases that continu‐ ous, low-latency processing opens up are quite different from those that naturally fall on a batch processing system. Whereas Hadoop and big data targeted analytics appli‐ cations, often in the data warehousing space, the low-latency nature of Kafka makes it applicable for the kind of core applications that directly power a business. This makes sense: events in a business are happening all the time, and the ability to react to them as they occur makes it much easier to build services that directly power the operation of the business, feed back into customer experiences, and so on. xviii | Foreword to the First Edition

The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00

Total Amount (¥)

Donation Count

← Back to List