Kafka Streams in Action (William P. Bejeck Jr.) (Z-Library)

M A N N I N G Bill Bejeck Foreword by Jun Rao SECOND EDITION Event-driven applications and microservices

The Kafka Streams Yelling application topology SRC-TOPIC OUT-TOPIC Source and sink topics are on the Kafka brokers. Source processor Sink processor UpperCase processor Source processor forwards the consumed records into the UpperCase processor. The UpperCase processor creates an upper-cased version of the original record value - it forwards results to the sink processor. The sink processor produces records back to a specified Kafka topic.

Praise for the first edition “A great way to learn about Kafka Streams and how it is a key enabler of event-driven applications.” —Neha Narkhede, co-creator of Apache Kafka “A comprehensive guide to Kafka Streams—from introduction to production!” —Bojan Djurkovic, Cvent “Bridges the gap between message brokering and real-time streaming analytics.” —Jim Mantheiy, Jr., Next Century “Valuable both as an introduction to streams as well as an ongoing reference.” —Robin Coe, TD Bank “Stream processing can be daunting. Kafka Streams in Action will teach you to use the technology wisely.” —Jose San Leandro, software engineer/DevOps, OSOCO.es “The book presents great insights on the problem and explains the concepts clearly.” —Michele Adduci, software engineer at OpenLimit SignCubes GmbHQuote “An excellent source of information on Kafka Streams, with real-world applications.” —László Hegedüs PhD, software developer, Erlang Solutions

Kafka Streams in Action SECOND EDITION EVENT-DRIVEN APPLICATIONS AND MICROSERVICES BILL BEJECK FOREWORD BY JUN RAO MANN I NG SHELTER ISLAND

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2024 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The authors and publisher have made every effort to ensure that the information in this book was correct at press time. The authors and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Manning Publications Co. Development editor: Frances Lefkowitz 20 Baldwin Road Technical development editor: John Guthrie PO Box 761 Review editors: Aleksandar Dragosavljević and Shelter Island, NY 11964 Dunja Nikitović Production editor: Keri Hales Copy editor: Alisa Larson Proofreader: Jason Everett Technical proofreader: Karsten Strøbæk Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781617298684 Printed in the United States of America

brief contents PART 1 ................................................................................... 1 1 ■ Welcome to the Kafka event streaming platform 3 2 ■ Kafka brokers 18 PART 2 .................................................................................43 3 ■ Schema Registry 45 4 ■ Kafka clients 88 5 ■ Kafka Connect 132 PART 3 ...............................................................................157 6 ■ Developing Kafka Streams 159 7 ■ Streams and state 188 8 ■ The KTable API 226 9 ■ Windowing and timestamps 257 10 ■ The Processor API 299 11 ■ ksqlDB 321v

BRIEF CONTENTSvi12 ■ Spring Kafka 352 13 ■ Kafka Streams Interactive Queries 374 14 ■ Testing 389

contents foreword xiii preface xv acknowledgments xvii about this book xix about the author xxiii about the cover illustration xxiv PART 1 ........................................................................ 1 1 Welcome to the Kafka event streaming platform 3 1.1 Event streaming 3 1.2 What is an event? 6 1.3 An event stream example 7 1.4 Introducing the Apache Kafka event streaming platform 8 Kafka brokers 9 ■ Schema Registry 10 ■ Producer and consumer clients 11 ■ Kafka Connect 12 ■ Kafka Streams 12 ksqlDB 12 1.5 A concrete example of applying the Kafka event streaming platform 13vii

CONTENTSviii2 Kafka brokers 18 2.1 Introducing Kafka brokers 19 2.2 Produce requests 19 2.3 Fetch requests 20 2.4 Topics and partitions 21 Offsets 23 ■ Determining the correct number of partitions 25 2.5 Sending your first messages 25 Creating a topic 26 ■ Producing records on the command line 26 Consuming records from the command line 27 ■ Partitions in action 28 2.6 Segments 29 Data retention 30 ■ Compacted topics 31 ■ Topic partition directory contents 32 2.7 Tiered storage 34 2.8 Cluster metadata 36 2.9 Leaders and followers 37 Replication 37 2.10 Checking for a healthy broker 41 Request handler idle percentage 42 ■ Network handler idle percentage 42 ■ Underreplicated partitions 42 PART 2 ...................................................................... 43 3 Schema Registry 45 3.1 Objects 46 3.2 What is a schema, and why do you need one? 47 What is Schema Registry? 48 ■ Getting Schema Registry 50 Architecture 50 ■ Communication: Using Schema Registry’s REST API 52 ■ Registering a schema 52 ■ Plugins and serialization platform tools 57 ■ Uploading a schema file 58 ■ Generating code from schemas 59 ■ End-to-end example 60 3.3 Subject name strategies 65 TopicNameStrategy 66 ■ RecordNameStrategy 67 TopicRecordNameStrategy 68 3.4 Schema compatibility 71 Backward compatibility 71 ■ Forward compatibility 72 Full compatibility 72 ■ No compatibility 73

CONTENTS ix3.5 Schema references 74 3.6 Schema references and multiple events per topic 78 3.7 Schema Registry (de)serializers 81 Avroserializers and deserializers 82 ■ Protobuf 83 JSON Schema 84 3.8 Serialization without Schema Registry 85 4 Kafka clients 88 4.1 Introducing Kafka clients 88 4.2 Producing records with the KafkaProducer 90 Producer configurations 93 ■ Kafka delivery semantics 94 Partition assignment 95 ■ Writing a custom partitioner 96 Specifying a custom partitioner 98 ■ Timestamps 98 4.3 Consuming records with the KafkaConsumer 98 The poll interval 101 ■ The group id configuration 102 Applying partition assignment strategies 107 ■ Static membership 108 ■ Committing offsets 110 4.4 Exactly-once delivery in Kafka 116 The idempotent producer 116 ■ Transactional producer 118 Consumers in transactions 121 ■ Producers and consumers within a transaction 122 4.5 Using the Admin API for programmatic topic management 124 4.6 Handling multiple event types in a single topic 125 Producing multiple event types 126 ■ Consuming multiple event types 127 5 Kafka Connect 132 5.1 An introduction to Kafka Connect 133 5.2 Integrating external applications into Kafka 134 5.3 Getting started with Kafka Connect 135 5.4 Applying Single Message Transforms 141 5.5 Adding a sink connector 143 5.6 Building and deploying your own connector 146 Implementing a connector 146 ■ Making your connector dynamic with a monitoring thread 150 ■ Creating a custom transformation 152

CONTENTSxPART 3 .................................................................... 157 6 Developing Kafka Streams 159 6.1 A look at Kafka Streams 159 6.2 Kafka Streams DSL 160 6.3 Hello World for Kafka Streams 160 Creating the topology for the Yelling app 161 ■ Kafka Streams configuration 166 ■ Serde creation 167 6.4 Masking credit card numbers and tracking purchase rewards in a retail sales setting 168 Building the source node and the masking processor 169 Adding the purchase-patterns processor 171 ■ Building the rewards processor 173 ■ Using Serdes to encapsulate serializers and deserializers in Kafka Streams 174 ■ Kafka Streams and Schema Registry 175 6.5 Interactive development 176 6.6 Choosing which events to process 178 Filtering purchases 178 ■ Splitting/branching the stream 179 Naming topology nodes 183 ■ Dynamic routing of messages 185 7 Streams and state 188 7.1 Stateful vs. stateless 189 7.2 Adding stateful operations to Kafka Streams 190 Group-by details 191 ■ Aggregation vs. reducing 193 Repartitioning the data 196 ■ Proactive repartitioning 201 Repartitioning to increase the number of tasks 204 ■ Using Kafka Streams optimizations 204 7.3 Stream-stream joins 206 Implementing a stream-stream join 207 ■ Join internals 208 ValueJoiner 210 ■ JoinWindows 211 ■ Co-partitioning 212 StreamJoined 213 ■ Other join options 213 ■ Outer joins 213 Left-outer join 213 7.4 State stores in Kafka Streams 215 Changelog topics restoring state stores 216 ■ Standby tasks 217 Assigning state stores in Kafka Streams 218 ■ State stores’ location on the filesystem 219 ■ Naming stateful operations 220 Specifying a store type 223 ■ Configuring changelog topics 224

CONTENTS xi8 The KTable API 226 8.1 KTable: The update stream 227 Updates to records or the changelog 229 ■ KStream and KTable API in action 230 8.2 KTables are stateful 232 8.3 The KTable API 233 8.4 KTable aggregations 233 8.5 GlobalKTable 239 8.6 Table joins 241 Stream–table join details 243 ■ Versioned KTables 245 Stream–global table join details 246 ■ Table–table join details 250 9 Windowing and timestamps 257 9.1 Understanding the role of windows and the different types 260 Hopping windows 262 ■ Tumbling windows 266 ■ Session windows 268 ■ Sliding windows 271 ■ Window time alignment 275 ■ Retrieving window results for analysis 277 9.2 Handling out order data with grace—literally 282 9.3 Final windowed results 285 Strict buffering 290 ■ Eager buffering 291 9.4 Timestamps in Kafka Streams 292 9.5 The TimestampExtractor 294 WallclockTimestampExtractorSystem .currentTimeMillis() method 295 ■ Custom TimestampExtractor 295 Specifying a TimestampExtractor 296 9.6 Stream time 296 10 The Processor API 299 10.1 Working with sources, processors, and sinks to create a topology 300 Adding a source node 301 ■ Adding a processor node 302 Adding a sink node 305 10.2 Digging deeper into the Processor API with a stock analysis processor 307 The stock-performance processor application 308 ■ Punctuation semantics 310 ■ The process() method 312 ■ The punctuator execution 314

CONTENTSxii10.3 Data-driven aggregation 315 10.4 Integrating the Processor API and the Kafka Streams API 319 11 ksqlDB 321 11.1 Understanding ksqlDB 322 11.2 More about streaming queries 325 11.3 Persistent vs. push vs. pull queries 333 11.4 Creating Streams and Tables 338 11.5 Schema Registry integration 341 11.6 ksqlDB advanced features 345 12 Spring Kafka 352 12.1 Introducing Spring 352 12.2 Using Spring to build Kafka-enabled applications 355 Spring Kafka application components 358 ■ Enhanced application requirements 362 12.3 Spring Kafka Streams 367 13 Kafka Streams Interactive Queries 374 13.1 Kafka Streams and information sharing 375 13.2 Learning about Interactive Queries 376 Building an Interactive Queries app with Spring Boot 379 14 Testing 389 14.1 Understanding the difference between unit and integration testing 390 Testing Kafka producers and consumers 391 ■ Creating tests for Kafka Streams operators 395 ■ Writing tests for a Kafka Streams topology 398 ■ Testing more complex Kafka Streams applications 401 ■ Developing effective integration tests 405 appendix A Schema compatibility workshop 412 appendix B Confluent resources 422 appendix C Working with Avro, Protobuf, and JSON Schema 424 appendix D Understanding Kafka Streams architecture 446 index 463

foreword When a business event occurs, traditional data-at-rest systems record it but leave the use of the data to a much later time. In contrast, Apache Kafka is a data streaming platform designed to react to business events in real time. Over the past decade, Apache Kafka has become the standard for data streaming. Hundreds of thousands of organizations, including most of the largest enterprises in the world, are using Kafka to take action on what’s happening to their business. Those actions allow them to enhance customer experience, gain new business insights, improve efficiency, reduce risks, and so on, all within a few short seconds. Applications built on top of Kafka are event driven. They typically take one or more data streams as the input and continuously transform these data into a new stream. The transformation often includes streaming operations such as filtering, pro- jection, joining, and aggregation. Expressing those operations in low-level Java code is inefficient and error prone. Kafka Streams provides a handful of high-level abstrac- tions for developers to express those common streaming operations concisely and is a very powerful tool for building event-driven applications in Java. Bill has been a long-time contributor to Kafka and is an expert in Kafka Streams. What’s unique about Bill is that not only does he understand the technology behind Kafka Streams, but he also knows how to use Kafka Streams to solve real problems. In this book, you will hear the key concepts in Kafka Streams directly from Bill. You will also see many hands-on examples of how to build end-to-end event-driven appli- cations using Kafka Streams, together with other Kafka APIs and Schema Registry. Ifxiii

FOREWORDxivyou are a developer wanting to learn how to build the next-gen event-driven applica- tions on Kafka, you’ll find this book invaluable. Enjoy the book and the power of data streaming! —Jun Rao, co-founder, Confluent and Apache Kafka

preface After completing the first edition of Kafka Streams in Action, I thought I had accom- plished everything I set out to do. But as time went on, my understanding of the Kafka ecosystem and my appreciation for Kafka Streams grew. I saw that Kafka Streams was more powerful than I had initially thought. Additionally, I noticed other important pieces in building event-streaming applications; Kafka Streams is still a key player but not the only requirement. I realized that Apache Kafka could be considered the cen- tral nervous system for an organization’s data. If Kafka is the central nervous system, then Kafka Streams is a vital organ performing some necessary operations. But Kafka Streams relies on other components to bring events into Kafka or export them to the outside world where its results and calculations can be put to good use. I’m talking about the producer and consumer clients and Kafka Connect. As I put the pieces together, I realized you need these other components to complete the event- streaming picture. Couple all this with some significant improvements to Kafka Streams since 2018, and I knew I wanted to write a second edition. But I didn’t just want to add cosmetic touches to the previous edition; I wanted to express my improved understanding and add complete coverage of the entire Kafka ecosystem. This meant expanding the scope of some subjects from sections of chap- ters to whole chapters (like the producer and consumer clients), or adding entirely new chapters (such as the new chapters on Connect and Schema Registry). For the existing Kafka Streams chapters, writing a second edition meant updating and improv- ing the existing material to clarify and communicate my deeper understanding.xv

PREFACExvi Taking on the second edition with this new focus during the Covid-19 pandemic wasn’t easy, and not without some serious personal challenges along the way. But in the end, it was worth every minute of revision, and if I went back in time, I’d make the same decision. I hope that new readers of Kafka Streams in Action will find the book an essential resource and that readers from the first edition will enjoy and apply the improve- ments as well.

acknowledgments First, I want to thank my wife, Beth, for supporting my signing up for a second edition. Writing the first edition of a book is very time-consuming, so you’d think the second edi- tion would be more straightforward, just making adjustments for things like API changes. But in this case, I wanted to expand on my previous work and decided to do an entire rewrite. Beth never questioned my decision and fully supported my new direc- tion, and as before, I couldn’t have completed this without her support. Beth, you are fantastic, and I’m very grateful to have you as my wife. I’d also like to thank my three children for having great attitudes and supporting me in doing a second edition. Next, I thank my editor at Manning, Frances Lefkowitz, whose continued expert guidance and patience made the writing process fun this time. I also thank John Guth- rie for his excellent, precise technical feedback and Karsten Strøbæk, the technical proofreader, for superb work reviewing the code. Many hands at Manning contrib- uted to the production of this edition; thanks to all of them. I’d also like to thank the Kafka Streams developers and community for being so engaging and brilliant in making Kafka Streams the best stream processing library available. I want to acknowledge all the Kafka developers for building such high-qual- ity software, especially Jay Kreps, Neha Narkhede, and Jun Rao, not only for starting Kafka in the first place but also for creating such a great place to work in Confluent. In addition, another special thanks to Jun for writing the foreword of this edition. Last but certainly not least, I thank the reviewers for their hard work and invalu- able feedback in making the quality of this book better for all readers: Alain Couniot, Allen Gooch, Andres Sacco, Balakrishnan Balasubramanian, Christian Thoudahl,xvii

ACKNOWLEDGMENTSxviiiDaniela Zapata, David Ong, Ezra Simeloff, Giampiero Granatella, John Roesler, Jose San Leandro Armendáriz, Joseph Pachod, Kent Spillner, Manzur Mukhitdinov, Michael Heil, Milorad Imbra, Miloš Milivojević, Najeeb Arif, Nathan B. Crocker, Robin Coe, Rui Liu, Sambasiva Andaluri, Scott Huang, Simon Hewitt, Simon Tschöke, Stanford S. Guillory, Tan Wee, Thomas Peklak, and Zorodzayi Mukuya.

about this book I wrote the second edition of Kafka Streams in Action to teach you how to build event streaming applications in Kafka Streams and include other components of the Kafka ecosystem, Producer and Consumer clients, Connect, and Schema Registry. I took this approach because for your event-streaming application to be as effective as possible, you’ll need not just Kafka Streams but other essential tools. My approach to writing this book is a pair-programming perspective; I imagine myself sitting next to you as you write the code and learn the API. You’ll learn about the Kafka broker and how the producer and consumer clients work. Then, you’ll see how to manage schemas, their role with Schema Registry, and how Kafka Connect bridges external components and Kafka. From there, you’ll dive into Kafka Streams, first building a simple application, then adding more complexity as you dig deeper into Kafka Streams API. You’ll also learn about ksqlDB, testing, and, finally, integrating Kafka with the popular Spring framework. Who should read this book Kafka Streams in Action is for any developer wishing to get into stream processing. While not strictly required, knowledge of distributed programming will help under- stand Kafka and Kafka Streams. Knowledge of Kafka is beneficial but not required; I’ll teach you what you need to know. Experienced Kafka developers and those new to Kafka will learn how to develop compelling stream-processing applications with Kafka Streams. Intermediate-to-advanced Java developers familiar with topics like serializa- tion will learn how to use their skills to build a Kafka Streams application. The book’sxix

Statistics

Uploader

Kafka Streams in Action (William P. Bejeck Jr.) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Statistics

Uploader

Kafka Streams in Action (William P. Bejeck Jr.) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment