(This page has no text content)
(This page has no text content)
Kafka Streams in Action, Second Edition 1. About_this_book 2. Acknowledgements 3. Preface 4. PART_1:_INTRODUCTION 5. 1_Welcome_to_the_Kafka_Event_Streaming_Platform 6. 2_Kafka_Brokers 7. PART_2:_GETTING_DATA_INTO_KAFKA 8. 3_Schema_Registry 9. 4_Kafka_Clients 10. 5_Kafka_Connect 11. PART_3:_EVENT_STREAM_PROCESSING_DEVELOPMENT 12. 6_Developing_Kafka_Streams 13. 7_Streams_and_State 14. 8_The_KTable_API 15. 9_Windowing_and_Timestamps 16. 10_The_Processor_API 17. 11_ksqlDB 18. 12_Spring_Kafka 19. 13_Kafka_Streams_interactive_queries 20. 14_Testing 21. Appendix_A._Schema_Compatibility_Workshop 22. Appendix_B._Working_with_Avro,_Protobuf_and_JSON_Schema 23. Appendix_C._Understanding_Kafka_Streams_architecture 24. Appendix_D._Confluent_Resources 25. index
About this book I wrote the 2nd edition of Kafka Streams in Action to teach you how to build event streaming applications in Kafka Streams and include other components of the Kafka ecosystem, Producer and Consumer clients, Connect, and Schema Registry. I took this approach because for your event-streaming application to be as effective as possible, you’ll need not just Kafka Streams but other essential tools. My approach to writing this book is a pair- programming perspective; I imagine myself sitting next to you as you write the code and learn the API. You’ll learn about the Kafka broker and how the producer and consumer clients work. Then, you’ll see how to manage schemas, their role with Schema Registry, and how Kafka Connect bridges external components and Kafka. From there, you’ll dive into Kafka Streams, first building a simple application, then adding more complexity as you dig deeper into Kafka Streams API. You’ll also learn about ksqlDB, testing, and, finally, integrating Kafka with the popular Spring framework. Who should read this book Kafka Streams in Action is for any developer wishing to get into stream processing. While not strictly required, knowledge of distributed programming will help understand Kafka and Kafka Streams. Knowledge of Kafka is beneficial but not required; I’ll teach you what you need to know. Experienced Kafka developers and those new to Kafka will learn how to develop compelling stream-processing applications with Kafka Streams. Intermediate-to-advanced Java developers familiar with topics like serialization will learn how to use their skills to build a Kafka Streams application. The book’s source code is written in Java 17 and extensively uses Java lambda syntax, so experience with lambdas (even from another language) will be helpful.
How this book is organized: a roadmap This book has three parts spread over 14 chapters. While the book’s title is “Kafka Streams in Action”, it covers the entire Kafka event streaming platform. As a result, the first five chapters cover the different components: Kafka brokers, consumer and producer clients, Schema Registry, and Kafka Connect. This approach makes sense, especially considering that Kafka Streams is an abstraction over the consumer and producer clients. So, if you’re already familiar with Kafka, Connect, and Schema Registry or if you’re excited to get going with Kafka Streams, then by all means, skip directly to Part 3. Part 1 introduces event streaming and describes the different parts of the Kafka ecosystem to show you the big-picture view of how it all works and fits together. These chapters also provide the basics of the Kafka broker for those who need them or want a review: 1. Chapter 1 provides some context on what is an event and event- streaming and why it’s vital for working with real-time data. It also presents the mental model of the different components we’ll cover: the broker, clients, Kafka Connect, Schema Registry, and, of course, Kafka Streams. I don’t go over any code but describe how they all work. 2. Chapter 2 is a primer for developers who are new to Kafka, and it covers the role of the broker, topics, partitions, and some monitoring. Those with more experience with Kafka can skip this chapter. Part 2 moves on and covers getting data into and out of Kafka and managing schemas: . Chapter 3 covers using Schema Registry to help you manage the evolution of your data’s schemas. Spoiler alert: you’re always using a schema-if not explicitly, then it’s implicitly there. . Chapter 4 discusses the Kafka producer and consumer clients. The clients are how you get data into and out of Kafka and provide the building blocks for Kafka Connect and Kafka Streams. . Chapter 5 is about Kafka Connect. Kafka Connect provides the ability to get data into Kafka via source connectors and export it to external systems with sink connectors.
Part 3 gets to the book’s heart and covers developing Kafka Streams applications. In this section, you’ll also learn about ksqlDB and testing your event-streaming application, and it concludes with integrating Kafka with the Spring Framework . Chapter 6 is your introduction to Kafka Streams, where you’ll build a Hello World application and, from there, build a more realistic application for a fictional retailer. Along the way, you’ll learn about the Kafka Streams DSL. . Chapter 7 continues your Kafka Streams learning path, where we discuss application state and why it’s required for streaming applications. In this chapter, some of the things you’ll learn about are aggregating data and joins. . Chapter 8: You’ll learn about the KTable API. Whereas a KStream is a stream of events, a KTable is a stream of related events or an update stream. . Chapter 9 covers windowed operations and timestamps. Windowing an aggregation allows you to bucket results by time, and the timestamps on the records drive the action. . Chapter 10 dives into the Kafka Streams Processor API. Up to this point, you’ve been working with the high-level DSL, but here, you’ll learn how to use the Processor API when you need more control. . Chapter 11 takes you further into the development stack, where you’ll learn about ksqlDB. ksqlDB allows you to write event- streaming applications without any code but using SQL. . Chapter 12 discusses using the Spring Framework with Kafka clients and Kafka Streams. Spring allows you to write more modular and testable code by providing a dependency injection framework for wiring up your applications. . Chapter 13 introduces you to Kafka Streams Interactive Queries or IQ. IQ is the ability to directly query the state store of a state operation in Kafka Streams. You’ll use what you learned in Chapter 12 to build a Spring-enabled IQ web application. . Chapter 14 covers the all-important topic of testing. You’ll learn how to test client applications with a Kafka Streams topology, the difference between unit testing and integration testing, and when to apply them. . Appendix A contains a workshop on Schema Registry to get hands-on experience with the different schema compatibility modes. . Appendix B is a survey of working with the different schema types Avro, Protobuf, and JSON Schema. . Appendix C covers the architecture and internals of Kafka Streams. . Appendix D presents information on using Confluent Cloud to help develop your event streaming applications. About the code
This book contains many examples of source code both in numbered listings andinline with normal text. In both cases, source code is formatted in a fixed-width font like this to separate it from ordinary text. In many cases, the original source code has been reformatted; we’ve added linebreaks and reworked indentation to accommodate the available page space in the book. In rare cases, even this was not enough, and listings include line-continuationmarkers (➥). Additionally, comments in the source code have often been removed from the list-ings when the code is described in the text. Code annotations accompany many of thelistings, highlighting important concepts. Finally, it’s important to note that many of the code examples aren’t meant tostand on their own: they’re excerpts containing only the most relevant parts of what is currently under discussion. You’ll find all the examples from the book in the accompanying source code in their complete form. Source code for the book’s examples is available from GitHub at https://github.com/bbejeck/KafkaStreamsInAction2ndEdition and the publisher’s website at www.manning.com/books/kafka-streams-in-action- second-edition. The source code for the book is an all-encompassing project using the build tool Gradle (https://gradle.org). You can import the project into either IntelliJ or Eclipse using the appropriate commands. Full instructions for using and navigating the sourcecode can be found in the accompanying README.md file. Other online resources 1. Apache Kafka documentation: https://kafka.apache.org 2. Confluent documentation: https://docs.confluent.io/current 3. Kafka Streams documentation: https://docs.confluent.io/current/streams/index.html#kafka-streams 4. ksqlDB documentation: https://ksqldb.io/ 5. Spring Framework: https://spring.io/
Acknowledgements I want to thank my wife, Beth, for supporting my signing up for a second edition. Writing the first edition of a book is very time-consuming, so you’d think the second edition would be more straightforward, just making adjustments for API changes. But in this case, I wanted more from my previous work and decided to do an entire rewrite. Beth never questioned my decision and fully supported my new direction, and as before, I couldn’t have completed this without her support. Beth, you are fantastic, and I’m very grateful to have you as my wife. I’d also like to thank my three children for having great attitudes and supporting me in doing a second edition. Next, I thank my editor at Manning, Frances Lefkowitz, whose continued expert guidance and patience made the writing process fun this time. I also thank John Guthrie for his excellent, precise technical feedback and Karsten Strøbæk, the technical proofer, for his superb work reviewing the code. I’d also like to thank the Kafka Streams developers and community for being so engaging and brilliant in making Kafka Streams the best stream processing library available. I want to acknowledge all the Kafka developers for building such high-quality software, especially Jay Kreps, Neha Narkhede, and Jun Rao, not only for starting Kafka in the first place but for creating such a great place to work in Confluent. Last but certainly not least, I thank the reviewers for their hard work and invaluable feedback in making the quality of this book better for all readers.
Preface After completing the first edition of Kafka Streams in Action, I thought that I had accomplished everything I had set out to do. But as time went on, my understanding of the Kafka ecosystem and my appreciation for Kafka Streams grew. I saw that Kafka Streams was more powerful than I had initially thought. Additionally, I noticed other important pieces in building event-streaming applications; Kafka Streams is still a key player but not the only requirement. I realized that Apache Kafka could be considered the central nervous system for an organization’s data. If Kafka is the central nervous system, then Kafka Streams is a vital organ performing some necessary operations. But Kafka Streams relies on other components to bring events into Kafka or export them to the outside world where its results and calculations can be put to good use. I’m talking about the producer and consumer clients and Kafka Connect. As I put the pieces together, I realized you need these other components to complete the event-streaming picture. Couple all this with some significant improvements to Kafka Streams since 2018, and I knew I wanted to write a second edition. But I didn’t just want to brush up on the previous edition; I wanted to express my improved understanding and add complete coverage of the entire Kafka ecosystem. This meant expanding the scope of some subjects from sections of chapters to whole chapters (like the producer and consumer clients), or adding entirely new chapters (such as the new chapters on Connect and Schema Registry). For the existing Kafka Streams chapters, writing a second edition meant updating and improving the existing material to clarify and communicate my deeper understanding. Taking on the second edition with this new focus during the pandemic was not easy and not without some serious personal challenges along the way. But in the end, it was worth every minute of it, and if I were to go back in time, I would make the same decision. I hope that new readers of Kafka Streams in Action will find the book an essential resource and that readers from the first
edition will enjoy and apply the improvements as well.
PART 1: INTRODUCTION In part one, you’ll learn about events and event streaming in general. Event streaming is a software development approach that considers events as an application’s primary input and output. But to develop an effective event streaming application, you’ll first need to learn what an event is (spoiler alert: it’s everything!). Then you’ll read about what use cases are good candidates for event-streaming applications and which are not. First, you’ll discover what a Kafka broker is and how it’s at the heart of the Kafka ecosystem, and the various jobs it performs. Then you’ll learn what Schema Registry, producer and consumer clients, Connect, and Kafka Streams are and their different roles. Then you’ll learn about the Apache Kafka event streaming platform; although this book focuses on Kafka Streams, it’s part of a larger whole that allows you to develop event- streaming applications. If this first part leaves you with more questions than answers, don’t fret; I’ll explain them all in subsequent chapters.
1 Welcome to the Kafka Event Streaming Platform This chapter covers Defining event streaming and events Introducing the Kafka event streaming platform Applying the platform to a concrete example While the constant influx of data creates more entertainment and opportunities for the consumer, increasingly, the users of this information are software systems using other software systems. Think, for example, of the fundamental interaction of watching a movie from your favorite movie streaming application. You log into the application, search for and select a film, then watch it, and afterward, you may provide a rating or some indication of how you enjoyed the movie. Just this simple interaction generates several events captured by the movie streaming service. But this information needs analysis if it’s to be of use to the business. That’s where all the other software comes into play. First, the software systems consume and store all the information obtained from your interaction and the interactions of other subscribers. Then, additional software systems use that information to make recommendations to you and to provide the streaming service with insight on what programming to provide in the future. Now, consider that this process occurs hundreds of thousands or even millions of times per day, and you can see the massive amount of information that businesses need to harness and that their software needs to make sense of to meet customer demands and expectations and stay competitive. Another way to think of this process is that everything modern-day consumers do, from streaming a movie online to purchasing a pair of shoes at a brick-and-mortar store, generates an event. For an organization to survive and excel in our digital economy, it must have an efficient way of capturing
and acting on these events. In other words, businesses must find ways to keep up with the demand of this endless flow of events if they want to satisfy customers and maintain a robust bottom line. Developers call this constant flow an event stream. And, increasingly, they are meeting the demands of this endless digital activity with an event-streaming platform, which utilizes a series of event-streaming applications. An event-streaming platform is analogous to our central nervous system, which processes millions of events (nerve signals) and, in response, sends out messages to the appropriate parts of the body. Our conscious thoughts and actions generate some of these responses. When we are hungry and open the refrigerator, the central nervous system gets the message and sends out another one, telling the arm to reach for a nice red apple on the first shelf. Other actions, such as your heart rate increasing in anticipation of exciting news, are handled unconsciously. An event-streaming platform captures events generated from mobile devices, customer interaction with websites, online activity, shipment tracking, and other business transactions. But the platform, like the nervous system, does more than capture events. It also needs a mechanism to reliably transfer and store the information from those events in the order in which they occurred. Then, other applications can process or analyze the events to extract different bits of that information. Processing the event stream in real time is essential for making time-sensitive decisions. For example, Does this purchase from customer X seem suspicious? Are the signals from this temperature sensor indicating something has gone wrong in a manufacturing process? Has the routing information been sent to the appropriate department of a business? But the value of an event-streaming platform goes beyond gaining immediate information. Providing durable storage allows us to go back and look at event-stream data in its raw form, perform some manipulation of the data for more insight, or replay a sequence of events to try and understand what led to a particular outcome. For example, an e-commerce site offers a fantastic deal on several products on the weekend after a big holiday. The response to the sale is so strong that it crashes a few servers and brings the business down for a few minutes. By replaying all customer events, engineers can better
understand what caused the breakdown and how to fix the system so it can handle a large, sudden influx of activity. So, where do you need event-streaming applications? Since everything in life can be considered an event, any problem domain will benefit from processing event streams. But there are some areas where it’s more important to do so. Here are some typical examples Credit card fraud — A credit card owner may be unaware of unauthorized use. By reviewing purchases as they happen against established patterns (location, general spending habits), you may be able to detect a stolen credit card and alert the owner. Intrusion detection — The ability to monitor aberrant behavior in real- time is critical for the protection of sensitive data and the well-being of an organization. The Internet of Things - With IoT, sensors are located in all kinds of places and send back data frequently. The ability to quickly capture and process this data meaningfully is essential; anything less diminishes the effect of deploying these sensors. The financial industry — The ability to track market prices and direction in real-time is essential for brokers and consumers to make effective decisions about when to sell or buy. Sharing data in real-time - Large organizations, like corporations or conglomerates, that have many applications need to share data in a standard, accurate, and real-time way Bottom line: If the event stream provides essential and actionable information, businesses and organizations need event-driven applications to capitalize on the information provided. But streaming applications are only a fit for some situations. Event-streaming applications become necessary when you have data in different places or a large volume of events requiring distributed data stores. So, if you can manage with a single database instance, streaming is unnecessary. For example, a small e-commerce business or a local government website with primarily static data aren’t good candidates for building an event-streaming solution.
In this book, you’ll learn about event-stream development, when and why it’s essential, and how to use the Kafka event-streaming platform to build robust and responsive applications. You’ll learn how to use the Kafka streaming platform’s various components to capture events and make them available for other applications. We’ll cover using the platform’s components for simple actions such as writing (producing) or reading (consuming) events to advanced stateful applications requiring complex transformations so you can solve the appropriate business challenges with an event-streaming approach. This book is suitable for any developer looking to get into building event- streaming applications. Although the title, "Kafka Streams in Action," focuses on Kafka Streams, this book teaches the entire Kafka event-streaming platform, end to end. That platform includes crucial components, such as producers, consumers, and schemas, that you must work with before building your streaming apps, which you’ll learn in Part 1. As a result, we don’t get into the subject of Kafka Streams itself until later in the book, in Chapter 6. But the enhanced coverage is worth it; Kafka Streams is an abstraction built on top of components of the Kafka event streaming platform, so understanding them gives you a better grasp of how you can use Kafka Streams. 1.1 What is an event ? So we’ve defined an event stream, but what is an event? We’ll define an event simply as "something that happens"[1]. While the term event probably brings to mind something notable happening, like the birth of a child, or a wedding or sporting event, we’re going to focus on smaller, more constant events like a customer making a purchase (online or in-person), or clicking a link on a web-page, or a sensor transmitting data. Either people or machines can generate events. It’s the sequence of events and the constant flow of them that make up an event stream. Events conceptually contain three main components: 1. Key - an identifier for the event 2. Value - the event itself 3. timestamp - when the event occurred
Let’s discuss each of these parts of an event in more detail. The key could be an identifier for the event, and as we’ll learn in later chapters, it plays a role in routing and grouping events. Think of an online purchase, and using the customer ID is an excellent example of the key. The value is the event payload itself. The event value could be a trigger, such as activating a sensor when someone opens a door or a result of some action like the item purchased in the online sale. Finally, the timestamp is the date-time when recording when the event occurred. As we go through the various chapters in this book, we’ll encounter all three components of this "event trinity" regularly. I’ve used a lot of different terms in this introduction, so let’s wrap this section up with a table of definitions: Event Something that occurs and attributes about it recorded Event Stream A series of events captured in real- time from sources such as mobile or IoT devices Event Streaming Platform Software to handle event streams - capable of producing, consuming, processing, and storage of event streams Apache Kafka The premier event streaming platform - it provides all the components of an event streaming platform in one battle-tested solution The native event stream processing
Kafka Streams library for Kafka 1.2 An event stream example Let’s say you’ve purchased a Flux Capacitor and are excited to receive your new purchase. Let’s walk through the events leading up to the time you get your brand new Flux Capacitor, using the following illustration as your guide. Figure 1.1 A sequence of events comprising an event stream starting with the online purchase of the flux ch01capacitor
1. You complete the purchase on the retailer’s website, and the site provides a tracking number. 2. The retailer’s warehouse receives the purchase event information and puts the Flux Capacitor on a shipping truck, recording the date and time your purchase left the warehouse. 3. The truck arrives at the airport, and the driver loads the Flux Capacitor on a plane and scans a barcode recording the date and time. 4. The plane lands, and the package is loaded on a truck again headed for the regional distribution center. The delivery service records the date and time they loaded your Flux Capacitor.
5. The truck from the airport arrives at the regional distribution center. A delivery service employee unloads the Flux Capacitor, scanning the date and time of the arrival at the distribution center. 6. Another employee takes your Flux Capacitor, scans the package, saves the date and time, and loads it on a truck bound for delivery to you. 7. The driver arrives at your house, scans the package one last time, and hands it to you. You can start building your time-traveling car! From our example here, you can see how everyday actions create events, hence an event stream. The individual events are the initial purchase, each time the package changes custody, and the final delivery. This scenario represents events generated by just one purchase. But if you think of the event streams generated by purchases from Amazon and the various shippers of the products, the number of events could easily number in the billions or trillions. 1.3 Introducing the Apache Kafka® event streaming platform The Kafka event streaming platform provides the core capabilities to implement your event streaming application from end-to-end. We can break down these capabilities into three main areas: publishing/consuming, durable storage, and processing. This move, store, and process trilogy enables Kafka to operate as the central nervous system for your data. Before we go on, it will be helpful to illustrate what it means for Kafka to be the central nervous system for your data. We’ll do this by showing before and after illustrations. Let’s first look at an event-streaming solution where each input source requires separate infrastructure: Figure 1.2 Initial event-streaming architecture leads to complexity as the different departments and data stream sources need to be aware of the other sources of events
In the above illustration, individual departments create separate infrastructures to meet their requirements. However, other departments may be interested in consuming the same data, which leads to a more complicated architecture to connect the various input streams. Let’s look at how the Kafka event streaming platform can change things. Figure 1.3 Using the Kafka event streaming platform, the architecture is simplified
Comments 0
Loading comments...
Reply to Comment
Edit Comment