Kafka Connect：.Build and Run Data Pipelines (Mickael MaisonKate Stanley) (Z-Library)

Mickael Maison & Kate Stanley Foreword by Jay Kreps Kafka Connect Build and Run Data Pipelines

“Kafka Connect is the pillar for integrating Apache Kafka with the rest of the data ecosystem. This book tells you everything you need to know to connect external data sources and sinks with Kafka.” —Jun Rao cofounder, Confluent “This comprehensive book covers everything from getting started to productionizing Kafka Connect at scale. It gives you the tools to build streaming data pipelines with Apache Kafka.” —Ryanne Dolan software engineer, LinkedIn DATA Kafka Connect Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia Kafka Connect is an awesome tool for building reliable and scalable data pipelines. As part of the Apache Kafka streaming platform, Kafka Connect allows you to easily get data into and out of your Kafka clusters and even mirror data between clusters. With this practical guide, you’ll learn how to build powerful pipelines without writing a single line of code. Authors Mickael Maison and Kate Stanley show data engineers, site reliability engineers, and application developers how to build pipelines between Kafka clusters and a variety of data sources and sinks. Kafka Connect allows you to quickly adopt Kafka by tapping into existing data and enabling many advanced use cases. No matter where you are in your event streaming journey, Kafka Connect is the ideal tool for building modern data pipelines. This book shows you how to: • Design resilient and efficient data pipelines by combining core Kafka Connect components • Capture database changes, build data lakes, and mirror Kafka clusters using existing connectors • Deploy, configure, and operate Kafka Connect clusters in production environments • Use logs and metrics to continually monitor Kafka Connect clusters • Run Kafka Connect clusters on Kubernetes • Write your own connectors and plug-ins Mickael Maison is a principal software engineer at Red Hat, and chair of the project management committee for Apache Kafka. Kate Stanley is a principal software engineer at Red Hat, a technical speaker, and a Java Champion. US $79.99 CAN $99.99 ISBN: 978-1-098-12653-7

Praise for Kafka Connect Kafka Connect is the pillar for integrating Apache Kafka with the rest of the data ecosystem. This book tells you everything you need to know to connect external data sources and sinks with Kafka. —Jun Rao, cofounder, Confluent This comprehensive book covers everything from getting started to productionizing Kafka Connect at scale. It gives you the tools to build streaming data pipelines with Apache Kafka. —Ryanne Dolan, software engineer, LinkedIn An invaluable resource for both novice and seasoned professionals working with Kafka Connect. It offers comprehensive explanations and a wealth of practical tips. —Robin Moffatt, rmoff.net An invaluable resource for anyone looking to use Kafka alongside existing systems. I only wish I’d had access to this book when I first began using Kafka Connect! —Danica Fine, senior developer advocate, Confluent

Mickael Maison and Kate Stanley Kafka Connect Build and Run Data Pipelines Boston Farnham Sebastopol TokyoBeijing

978-1-098-12653-7 [LSI] Kafka Connect by Mickael Maison and Kate Stanley Copyright © 2023 Mickael Maison and Kate Stanley. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jessica Haberman Development Editor: Jeff Bleiel Production Editor: Kristen Brown Copyeditor: Liz Wheeler Proofreader: Tove Innis Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea September 2023: First Edition Revision History for the First Edition 2023-09-18: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098126537 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Kafka Connect, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Red Hat. See our statement of editorial independence.

Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Part I. Introduction to Kafka Connect 1. Meet Kafka Connect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Kafka Connect Features 4 Pluggable Architecture 5 Scalability and Reliability 6 Declarative Pipeline Definition 7 Part of Apache Kafka 7 Use Cases 8 Capturing Database Changes 8 Mirroring Kafka Clusters 9 Building Data Lakes 9 Aggregating Logs 10 Modernizing Legacy Systems 10 Alternatives to Kafka Connect 11 Summary 11 2. Apache Kafka Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A Distributed Event Streaming Platform 13 Open Source 13 Distributed 14 Event Streaming 15 Platform 15 v

Kafka Concepts 17 Publish-Subscribe 17 Brokers and Records 18 Topics and Partitions 19 Replication 20 Retention and Compaction 21 KRaft and ZooKeeper 22 Interacting with Kafka 23 Producers 23 Consumers 26 Kafka Streams 28 Getting Started with Kafka 30 Starting Kafka 30 Sending and Receiving Records 31 Running a Kafka Streams Application 32 Summary 33 Part II. Developing Data Pipelines with Kafka Connect 3. Components in a Kafka Connect Data Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Kafka Connect Runtime 37 Running Kafka Connect 38 Kafka Connect REST API 39 Installing Plug-Ins 40 Deployment Modes 41 Source and Sink Connectors 42 Connectors and Tasks 43 Configuring Connectors 43 Running Connectors 44 Converters 46 Data Format and Schemas 46 Configuring Converters 48 Using Converters 49 Transformations and Predicates 51 Transformation Use Cases 53 Predicates 55 Configuring Transformations and Predicates 55 Using Transformations and Predicates 58 Summary 59 vi | Table of Contents

4. Designing Effective Data Pipelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Choosing a Connector 61 Pipeline Direction 62 Licensing and Support 62 Connector Features 63 Defining Data Models 64 Data Transformation 64 Mapping Data Between Systems 65 Formatting Data 68 Data Formats 68 Schemas 69 Exploring Kafka Connect Internals 72 Internal Topics 72 Group Membership 73 Rebalance Protocols 74 Handling Failures in Kafka Connect 75 Worker Failure 76 Connector/Task Failure 77 Kafka/External Systems Failure 79 Dead Letter Queues 79 Understanding Processing Semantics 80 Sink Connectors 81 Source Connectors 83 Summary 84 5. Connectors in Action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Confluent S3 Sink Connector 87 Configuring the Connector 88 Exactly-Once Semantics 94 Running the Connector 95 Confluent JDBC Source Connector 100 Configuring the Connector 100 Running the Connector 105 Debezium MySQL Source Connector 111 Configuring the Connector 112 Event Formats 117 Running the Connector 120 Summary 125 6. Mirroring Clusters with MirrorMaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Introduction to Mirroring 127 Exploring Mirroring Use Cases 128 Table of Contents | vii

Mirroring in Practice 132 Introduction to MirrorMaker 132 Common Concepts 134 Deployment Modes 137 MirrorMaker Connectors 140 MirrorSourceConnector 140 MirrorCheckpointConnector 146 MirrorHeartbeatConnector 150 Running MirrorMaker 152 Disaster Recovery Example 152 Geo-Replication Example 157 Summary 162 Part III. Running Kafka Connect in Production 7. Deploying and Operating Kafka Connect Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Preparing the Kafka Connect Environment 165 Building a Kafka Connect Environment 167 Installing Plug-Ins 169 Networking and Permissions 170 Worker Plug-Ins 172 Configuration Providers 173 REST Extensions 175 Connector Client Configuration Override Policies 177 Sizing and Planning Capacity 179 Understanding Kafka Connect Resource Utilization 179 How Many Workers and Tasks? 180 Operating Kafka Connect Clusters 183 Adding Workers 184 Removing Workers 185 Upgrading and Applying Maintenance to Workers 185 Restarting Failed Tasks and Connectors 187 Resetting Offsets of Connectors 188 Administering Kafka Connect Using the REST API 192 Creating and Deleting a Connector 193 Connector and Task Configuration 198 Controlling the Lifecycle of Connectors 202 Listing Connector Offsets 203 Debugging Issues 205 Summary 206 viii | Table of Contents

8. Configuring Kafka Connect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Configuring the Runtime 210 Configurations for Production 212 Fine-Tuning Configurations 215 Configuring Connectors 220 Topic Configurations 223 Client Overrides 225 Configurations for Exactly-Once 227 Configurations for Error Handling 228 Configuring Kafka Connect Clusters for Security 230 Securing the Connection to Kafka 231 Configuring Permissions 238 Securing the REST API 239 Summary 241 9. Monitoring Kafka Connect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Monitoring Logs 244 Logging Configuration 245 Understanding Startup Logs 248 Analyzing Logs 251 Monitoring Metrics 253 Metrics Reporters 255 Analyzing Metrics 256 Exploring Metrics 257 Key Metrics 260 Kafka Connect Runtime Metrics 260 Other System Metrics 266 Summary 270 10. Administering Kafka Connect on Kubernetes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Introduction to Kubernetes 271 Virtualization Technologies 271 Kubernetes Fundamentals 273 Running Kafka Connect on Kubernetes 278 Container Image 278 Deploying Workers 280 Networking and Monitoring 282 Configuration 284 Using a Kubernetes Operator to Deploy Kafka Connect 287 Introduction to Kubernetes Operators 287 Kubernetes Operators for Kafka Connect 288 Strimzi 289 Table of Contents | ix

Getting a Kubernetes Environment 289 Starting the Operator 290 Kafka Connect CRDs 291 Deploying a Kafka Connect Cluster and Connectors 293 MirrorMaker CRD 297 Summary 301 Part IV. Building Custom Connectors and Plug-Ins 11. Building Source and Sink Connectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Common Concepts and APIs 305 Building a Custom Connector 306 The Connector API 309 Configurations 314 The Task API 321 Kafka Connect Records 324 The ConnectorContext API 327 Implementing Source Connectors 328 The SourceTask API 328 Source Records 331 The SourceConnectorContext and SourceTaskContext APIs 332 Exactly-Once Support 334 Implementing Sink Connectors 337 The SinkTask API 338 Sink Records 341 The SinkConnectorContext and SinkTaskContext APIs 341 Summary 344 12. Extending Kafka Connect with Connector and Worker Plug-Ins. . . . . . . . . . . . . . . . . . 347 Implementing Connector Plug-Ins 348 The Transformation API 349 The Predicate API 351 The Converter and HeaderConverter APIs 353 Implementing Worker Plug-Ins 359 The ConfigProvider API 360 The ConnectorClientConfigOverridePolicy API 363 The ConnectRestExtension APIs 365 Summary 367 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 x | Table of Contents

Foreword Consensus protocols, stream processing, distributed systems—in the midst of all the exciting ideas in the streaming world, it can be easy to overlook the role of the humble connector. But connectors solve the most fundamental problem in the streaming world—in a world of data at rest, how do you access streams at all? How do you plug your data-streaming platform into the rest of the business? Kafka Connect’s aim is to make that easier. Before the Kafka Connect framework existed, we saw many people build integrations with Apache Kafka and repeat the same mistakes. Reading data from one system and writing it to another seems simple enough, but the process can have a lot of hidden complexity. What happens if a machine fails? What happens when requests time out? How do you scale up your integration? Each unique Kafka integration had to solve these problems from scratch. Kafka Connect was designed to separate out the logic of reading and writing to a particular system from a general framework for building, operating, and scaling these integrations. Kafka Connect is different from other integration or connector layers in a lot of important ways: • It’s designed for streaming first. • It works with Kafka’s semantics to enable exactly-once from systems that will allow it, and the strongest semantics possible for systems that don’t. • It lets you not just capture bytes, but also propagate some of the semantic structure of data. • It solves a lot of the complex problems in partitioning, scaling, and fault tolerance. • It lets you operate and monitor a pool of diverse connectors in a homogeneous way off a shared pool of hardware. xi

I think that these reasons are why Kafka Connect has proven so popular, why it has been embraced by Kafka users of all types, and why its ecosystem of connectors has grown to number in the hundreds. Success with Kafka Connect lets you open up data from all across your company to your data-streaming platform, but before taking it to production, it’s important to recognize that Kafka Connect is a sophisticated distributed system in its own right and to learn a little bit about how it works. This book is an excellent way to learn how to use connectors, configure the Kafka Connect framework, monitor and operate it in production, and even develop your own source and sink connectors. I can’t think of a better resource for those hoping to dive into this important topic and a faster way to get Kafka connected to the rest of your systems and applications. — Jay Kreps CEO, Confluent August 2023 xii | Foreword

Preface Kafka Connect is an awesome tool for building reliable and scalable data pipelines. It is part of the popular Apache Kafka streaming platform, and while it may not get as much attention as the brokers, clients, or Kafka Streams, Kafka Connect is a tool to be aware of. It allows you to easily get data into and out of your Kafka clusters and even mirror data between clusters. Its pluggable design makes it possible to build powerful pipelines without writing a single line of code. We are both passionate about sharing knowledge, whether that is through presenting at conferences, writing blog posts, or just helping out fellow Kafka enthusiasts. As a result, we have spent a lot of time chatting about both Kafka and Kafka Connect to users and developers all around the world. As Kafka is a tremendously popular technology, there are a lot of great resources available such as books, blog posts, and tutorials. Many of these do cover Kafka Connect, but we see a lack of resources that go deeper into its various use cases, configurations, and operational processes. Although Kafka Connect is not hard to start using with basic knowledge, its flexibility and range of features mean that having a deeper understanding of how it works can really make a difference. We have both given plenty of conference talks about Kafka Connect that go beyond the basics, but there’s only so much you can fit into a 40-minute session. In writing this book, we have brought together all the knowledge we have shared over the past few years about Kafka Connect, plus everything that you can’t fit into a conference session or a blog post! This includes our own individual experiences running it and the insights we’ve gained helping and advising customers. We have also taken the time to delve into every configuration setting, metric, and API to provide a thorough explanation of how Kafka Connect works. This has often involved writing custom plug-ins to try out code paths, poring over the code, and chatting with other Kafka contributors. xiii

1 Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty, Kafka: The Definitive Guide, 2nd Ed. (O’Reilly, 2021); Neha Narkhede, Gwen Shapira, and Todd Palino, 1st Ed. (2017). This book will give you all the knowledge needed to build reliable data pipelines for your use cases and run them in production. Kafka: The Definitive Guide1 is the go-to text for Kafka (we both keep a copy on our desk) and we hope this book will be the same for Kafka Connect. Who Should Read This Book This book is written for all roles that interact with Kafka Connect environments. We have chosen to use the terms data engineers, site reliability engineers, and developers to distinguish between roles. Data engineers design and build pipelines to process and analyze data. This includes selecting the correct tools, designing the data flow, and testing the pipeline. Site reliability engineers are responsible for deploying and administering Kafka Connect environments. They may manage a single Kafka Con‐ nect cluster or many, and each cluster might be running multiple data pipelines. Finally, developers customize Kafka Connect by building custom plug-ins. This is an advanced use case, but much of the knowledge that is applicable to this role is also useful for data engineers to assess available tools. In many organizations, it is likely the same engineers who perform all three roles, but in larger organizations it could be completely different teams. Although we split the book into multiple parts to cover these different roles, you will likely find it useful to understand them all. You don’t need any prior knowledge of Kafka or Kafka Connect to read this book. If you are already familiar with Kafka, feel free to skip Chapter 2, as this covers the Kafka basics you need to understand to use Kafka Connect. Equally, if you are already familiar with Kafka Connect, this book is still written with you in mind. Throughout the book we share best practices and advanced tips to help you develop your expertise further. Kafka Versions Kafka is a very active project, and each new version (released roughly every four months) brings new features and changes. At some point we had to stop revising and pick a version so we could get the book into the hands of readers. We settled on Kafka 3.5.0, released in June 2023, as the version of Kafka to refer to. Any significant change to Kafka first needs to be voted on by the community. To facilitate this, Kafka uses Kafka Improvement Proposals (KIPs). A KIP is a document in the Kafka wiki that describes the motivation for the change as well as its technical xiv | Preface

details. Throughout the book, we mention KIPs that are relevant to the features and concepts we cover. If you’re interested in a particular feature, we highly recommend checking out the related KIP to see the motivation and history behind the change. Be aware, though, that sometimes final implementations diverge from the original proposal. Navigating This Book This book is composed of twelve chapters grouped into four parts. Part I consists of two chapters and provides an introduction to Kafka Connect and Kafka in general. It is mostly aimed at engineers who are new to or just getting started with Kafka Connect. Part II consists of four chapters and explains how to build data pipelines with Kafka Connect. It is particularly relevant for data engineers. Chapters 3 and 4 discuss the core Kafka Connect components and explain how to design resilient and efficient data pipelines by combining them. The other chapters in this part look in detail at some of the most popular connectors. Chapter 5 covers three connectors from the community: Confluent S3 sink, Confluent JDBC source, and Debezium MySQL source. Chapter 6 details how MirrorMaker, the mirroring tool which is part of Kafka, works. This includes the features and configurations of the Source, Check‐ point, and Heartbeat connectors. Part III consists of four chapters and focuses on the operational aspects of running Kafka Connect. It is aimed at site reliability engineers. Chapter 7 shows how to deploy and operate Kafka Connect clusters in production environments. Chapter 8 goes over all the configuration settings Kafka Connect exposes, and provides some background and context to help you decide how and when to tune them. Chapter 9 describes how to use logs and metrics to continually monitor Kafka Connect clusters. Finally, Chapter 10 discusses the core considerations needed to run Kafka Connect clusters on Kubernetes. This includes a high-level introduction to Kubernetes and an explanation of the options available for deploying Kafka Connect on this type of infrastructure. Part IV consists of two chapters and explains how to implement custom connectors and plug-ins for Kafka Connect. It goes into detail about the APIs and is aimed at developers looking to customize Kafka Connect for their use cases. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Preface | xv

Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. <REPLACE_ME> Text within angle brackets should be replaced with user-supplied values or by values determined by context. For example, if running a file for a connector called my-source, the text might show /connectors/<CONNECTOR_NAME>/config, and you should update it to become /connectors/my-source/config. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. xvi | Preface

How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-829-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://www.oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/KafkaConnect. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media. Follow us on Twitter: https://twitter.com/oreillymedia. Watch us on YouTube: https://youtube.com/oreillymedia. Acknowledgements First, we would like to thank all the contributors and members of the Apache Kafka community. This vibrant and welcoming community is one of the reasons why Kafka is so popular and still growing and improving all the time. Special thanks to Jay Kreps, who took the time to provide the Foreword for this book. We also thank the many reviewers who have provided feedback throughout the process of writing this book: Robin Moffatt, Randall Hauch, Chris Egerton, Ryanne Dolan, Dale Lane, Gerard Ryan, Jakub Scholz, Paolo Patierno, Federico Valeri, Andrew Schofield, and Chris Cranford. Your input really made a difference and greatly improved the quality of this book. Additionally, we thank the readers who submitted feedback to us after reading the early access version on the O’Reilly website. Thanks to Eric Johnson and Jess Haberman for making this book a reality and to Aaron Black and Gregory Hyman from the O’Reilly team. We also thank our O’Reilly development editor Jeff Bleiel for all his help in getting the book written and for adjusting timelines to fit around our personal time constraints. We want to thank all the members of the Kafka team at Red Hat for giving us the space to work on this book for many months. Preface | xvii

Mickael wants to thank his family and friends for supporting him throughout this project. Writing a book takes a lot of time, and their help was very important to keep him focused and allow him to finish this book. Kate would like to thank her husband, Russell, for his patience and support during the writing process. Without his help she wouldn’t have been able to juggle complet‐ ing a book and becoming a first-time mum. She also wants to thank her parents for their ongoing encouragement in all her pursuits. Finally, Kate thanks her mentors Holly and Erin for showing her what women in tech are capable of. xviii | Preface

PART I Introduction to Kafka Connect The first part of this book gives a high-level introduction to Kafka Connect. It is aimed at data engineers, architects, and site reliability engineers who are unsure what Kafka Connect is or whether it is the right tool for them. This part explains Kafka Connect’s key features and why it is so popular. It covers the most common use cases and lists some alternatives. Finally, it provides a basic intro‐ duction to Apache Kafka to help you gain the minimum level of Kafka knowledge needed to fully use Kafka Connect. This includes key terms and concepts and how to set up a simple deployment.

Statistics

Uploader

Kafka Connect：.Build and Run Data Pipelines (Mickael MaisonKate Stanley) (Z-Library)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Blog & Notes

Recommended for You

Statistics

Uploader

Kafka Connect：.Build and Run Data Pipelines (Mickael MaisonKate Stanley) (Z-Library)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Blog & Notes

Recommended for You