Building Data Streaming Applications with Apache Kafka (Kumar, Manish Singh, Chanchal) (z-library.sk, 1lib.sk, z-lib.sk)

Designing and deploying enterprise messaging queues Building Data Streaming Applications with Apache Kafka Manish Kumar, Chanchal Singh

Building Data Streaming Applications with Apache Kafka Manish Kumar Chanchal Singh BIRMINGHAM - MUMBAI

Building Data Streaming Applications with Apache Kafka Copyright © 2017 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: August 2017 Production reference: 1170817 ISBN 978-1-78728-398-5

Credits Authors Manish Kumar Chanchal Singh Copy Editor Manisha Sinha Reviewer Anshul Joshi Project Coordinator Manthan Patel Commissioning Editor Amey Varangaonkar Proofreader Safis Editing Acquisition Editor Tushar Gupta Indexer Tejal Daruwale Soni Content Development Editor Tejas Limkar Graphics Tania Dutta Technical Editor Dinesh Chaudhary Production Coordinator Deepika Naik

About the Authors Manish Kumar is a Technical Architect at DataMetica Solution Pvt. Ltd.. He has approximately 11 years, experience in data management, working as a Data Architect and Product Architect. He has extensive experience in building effective ETL pipelines, implementing security over Hadoop, and providing the best possible solutions to Data Science problems. Before joining the world of big data, he worked as an Tech Lead for Sears Holding, India. He is a regular speaker on big data concepts such as Hadoop and Hadoop Security in various events. Manish has a Bachelor's degree in Information Technology. I would like to thank my parents, Dr. N.K. Singh and Mrs. Rambha Singh, for their support and blessings, my wife; Mrs. Swati Singh, for her successfully keeping me healthy and happy; and my adorable son, Master Lakshya Singh, for teaching me how to enjoy the small things in life. I would like to extend my gratitude to Mr. Prashant Jaiswal, whose mentorship and friendship will remain gems of my life, and Chanchal Singh, my esteemed friend, for standing by me in times of trouble and happiness. This note will be incomplete if I do not mention Mr. Anand Deshpande, Mr. Parashuram Bastawade, Mr. Niraj Kumar, Mr. Rajiv Gupta, and Dr. Phil Shelley for giving me exciting career opportunities and showing trust in me, no matter how adverse the situation was.

Chanchal Singh is a Software Engineer at DataMetica Solution Pvt. Ltd.. He has over three years' experience in product development and architect design, working as a Product Developer, Data Engineer, and Team Lead. He has a lot of experience with different technologies such as Hadoop, Spark, Storm, Kafka, Hive, Pig, Flume, Java, Spring, and many more. He believes in sharing knowledge and motivating others for innovation. He is the co-organizer of the Big Data Meetup - Pune Chapter. He has been recognized for putting innovative ideas into organizations. He has a Bachelor's degree in Information Technology from the University of Mumbai and a Master's degree in Computer Application from Amity University. He was also part of the Entrepreneur Cell in IIT Mumbai. I would like to thank my parents, Mr. Parasnath Singh and Mrs. Usha Singh, for showering their blessings on me and their loving support. I am eternally grateful to my love, Ms. Jyoti, for being with me in every situation and encouraging me. I would also like to express my gratitude to all the mentors I've had over the years. Special thanks to Mr Abhijeet Shingate who helped me as a mentor and guided me in the right direction during the initial phase of my career. I am highly indebted to Mr. Manish Kumar, without whom writing this book would have been challenging, for always enlightening me and sharing his knowledge with me. I would like to extend my sincere thanks by mentioning a few great personalities: Mr Rajiv Gupta, Mr. Niraj Kumar, Mr. Parashuram Bastawade, and Dr.Phil Shelley for giving me ample opportunities to explore solutions for real customer problems and believing in me.

About the Reviewer Anshul Joshi is a Data Scientist with experience in recommendation systems, predictive modeling, neural networks, and high performance computing. His research interests are deep learning, artificial intelligence, computational physics, and biology. Most of the time, he can be caught exploring GitHub or trying anything new that he can get his hands on. He blogs on .

www.PacktPub.com For support files and downloads related to your book, please visit . Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at; and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details. At , you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career. Why subscribe? Fully searchable across every book published by Packt Copy and paste, print, and bookmark content On demand and accessible via a web browser

Customer Feedback Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at . If you'd like to join our team of regular reviewers, you can e-mail us at . We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Table of Contents Preface 1 Chapter 1: Introduction to Messaging Systems 7 Understanding the principles of messaging systems 8 Understanding messaging systems 9 Peeking into a point-to-point messaging system 12 Publish-subscribe messaging system 15 Advance Queuing Messaging Protocol 18 Using messaging systems in big data streaming applications 19 Summary 23 Chapter 2: Introducing Kafka the Distributed Messaging Platform 24 Kafka origins 25 Kafka's architecture 26 Message topics 28 Message partitions 30 Replication and replicated logs 33 Message producers 36 Message consumers 36 Role of Zookeeper 37 Summary 38 Chapter 3: Deep Dive into Kafka Producers 40 Kafka producer internals 41 Kafka Producer APIs 45 Producer object and ProducerRecord object 47 Custom partition 50 Additional producer configuration 51 Java Kafka producer example 53 Common messaging publishing patterns 55 Best practices 57 Summary 58 Chapter 4: Deep Dive into Kafka Consumers 60 Kafka consumer internals 61 Understanding the responsibilities of Kafka consumers 61 Kafka consumer APIs 64

[ ] Consumer configuration 64 Subscription and polling 66 Committing and polling 67 Additional configuration 69 Java Kafka consumer 70 Scala Kafka consumer 72 Rebalance listeners 73 Common message consuming patterns 74 Best practices 77 Summary 78 Chapter 5: Building Spark Streaming Applications with Kafka 79 Introduction to Spark 80 Spark architecture 80 Pillars of Spark 82 The Spark ecosystem 84 Spark Streaming 86 Receiver-based integration 86 Disadvantages of receiver-based approach 88 Java example for receiver-based integration 89 Scala example for receiver-based integration 90 Direct approach 91 Java example for direct approach 93 Scala example for direct approach 94 Use case log processing - fraud IP detection 95 Maven 95 Producer 99 Property reader 99 Producer code 100 Fraud IP lookup 102 Expose hive table 103 Streaming code 104 Summary 106 Chapter 6: Building Storm Applications with Kafka 107 Introduction to Apache Storm 108 Storm cluster architecture 108 The concept of a Storm application 110 Introduction to Apache Heron 112 Heron architecture 112 Heron topology architecture 113 Integrating Apache Kafka with Apache Storm - Java 115

[ ] Example 116 Integrating Apache Kafka with Apache Storm - Scala 120 Use case – log processing in Storm, Kafka, Hive 123 Producer 127 Producer code 128 Fraud IP lookup 130 Running the project 139 Summary 139 Chapter 7: Using Kafka with Confluent Platform 140 Introduction to Confluent Platform 140 Deep driving into Confluent architecture 142 Understanding Kafka Connect and Kafka Stream 146 Kafka Streams 146 Playing with Avro using Schema Registry 147 Moving Kafka data to HDFS 148 Camus 149 Running Camus 150 Gobblin 151 Gobblin architecture 151 Kafka Connect 154 Flume 154 Summary 157 Chapter 8: Building ETL Pipelines Using Kafka 158 Considerations for using Kafka in ETL pipelines 159 Introducing Kafka Connect 160 Deep dive into Kafka Connect 162 Introductory examples of using Kafka Connect 164 Kafka Connect common use cases 167 Summary 168 Chapter 9: Building Streaming Applications Using Kafka Streams 169 Introduction to Kafka Streams 170 Using Kafka in Stream processing 170 Kafka Stream - lightweight Stream processing library 171 Kafka Stream architecture 173 Integrated framework advantages 176 Understanding tables and Streams together 176 Maven dependency 177 Kafka Stream word count 177

[ ] KTable 179 Use case example of Kafka Streams 180 Maven dependency of Kafka Streams 180 Property reader 181 IP record producer 182 IP lookup service 184 Fraud detection application 186 Summary 187 Chapter 10: Kafka Cluster Deployment 188 Kafka cluster internals 189 Role of Zookeeper 189 Replication 190 Metadata request processing 192 Producer request processing 193 Consumer request processing 193 Capacity planning 194 Capacity planning goals 195 Replication factor 195 Memory 195 Hard drives 196 Network 197 CPU 197 Single cluster deployment 197 Multicluster deployment 198 Decommissioning brokers 200 Data migration 201 Summary 202 Chapter 11: Using Kafka in Big Data Applications 203 Managing high volumes in Kafka 204 Appropriate hardware choices 204 Producer read and consumer write choices 206 Kafka message delivery semantics 207 At least once delivery 208 At most once delivery 211 Exactly once delivery 213 Big data and Kafka common usage patterns 214 Kafka and data governance 216 Alerting and monitoring 218

[ ] Useful Kafka matrices 218 Producer matrices 219 Broker matrices 220 Consumer metrics 220 Summary 221 Chapter 12: Securing Kafka 222 An overview of securing Kafka 222 Wire encryption using SSL 223 Steps to enable SSL in Kafka 224 Configuring SSL for Kafka Broker 225 Configuring SSL for Kafka clients 225 Kerberos SASL for authentication 226 Steps to enable SASL/GSSAPI - in Kafka 228 Configuring SASL for Kafka broker 229 Configuring SASL for Kafka client - producer and consumer 230 Understanding ACL and authorization 231 Common ACL operations 232 List ACLs 233 Understanding Zookeeper authentication 234 Apache Ranger for authorization 235 Adding Kafka Service to Ranger 235 Adding policies 237 Best practices 239 Summary 240 Chapter 13: Streaming Application Design Considerations 241 Latency and throughput 242 Data and state persistence 243 Data sources 244 External data lookups 244 Data formats 245 Data serialization 246 Level of parallelism 246 Out-of-order events 247 Message processing semantics 247 Summary 248 Index 249

Preface Apache Kafka is a popular distributed streaming platform that acts as a messaging queue or an enterprise messaging system. It lets you publish and subscribe to a stream of records and process them in a fault-tolerant way as they occur. This book is a comprehensive guide to designing and architecting enterprise-grade streaming applications using Apache Kafka and other big data tools. It includes best practices for building such applications and tackles some common challenges such as how to use Kafka efficiently to handle high data volumes with ease. This book first takes you through understanding the type messaging system and then provides a thorough introduction to Apache Kafka and its internal details. The second part of the book takes you through designing streaming application using various frameworks and tools such as Apache Spark, Apache Storm, and more. Once you grasp the basics, we will take you through more advanced concepts in Apache Kafka such as capacity planning and security. By the end of this book, you will have all the information you need to be comfortable with using Apache Kafka and to design efficient streaming data applications with it. What this book covers , Introduction to Messaging System, introduces concepts of messaging systems. It covers an overview of messaging systems and their enterprise needs. It further emphasizes the different ways of using messaging systems such as point to point or publish/subscribe. It introduces AMQP as well. , Introducing Kafka - The Distributed Messaging Platform, introduces distributed messaging platforms such as Kafka. It covers the Kafka architecture and touches upon its internal component. It further explores the roles and importance of each Kafka components and how they contribute towards low latency, reliability, and the scalability of Kafka Message Systems. , Deep Dive into Kafka Producers, is about how to publish messages to Kafka Systems. This further covers Kafka Producer APIs and their usage. It showcases examples of using Kafka Producer APIs with Java and Scala programming languages. It takes a deep dive into Producer message flows and some common patterns for producing messages to Kafka Topics. It walks through some performance optimization techniques for Kafka Producers.

Preface [ 2 ] , Deep Dive into Kafka Consumers, is about how to consume messages from Kafka Systems. This also covers Kafka Consumer APIs and their usage. It showcases examples of using Kafka Consumer APIs with the Java and Scala programming languages. It takes a deep dive into Consumer message flows and some common patterns for consuming messages from Kafka Topics. It walks through some performance optimization techniques for Kafka Consumers. , Building Spark Streaming Applications with Kafka, is about how to integrate Kafka with the popular distributed processing engine, Apache Spark. This also provides a brief overview about Apache Kafka, the different approaches for integrating Kafka with Spark, and their advantages and disadvantages. It showcases examples in Java as well as in Scala with use cases. , Building Storm Applications with Kafka, is about how to integrate Kafka with the popular real-time processing engine Apache Storm. This also covers a brief overview of Apache Storm and Apache Heron. It showcases examples of different approaches of event processing using Apache Storm and Kafka, including guaranteed event processing. , Using Kafka with Confluent Platform, is about the emerging streaming platform Confluent that enables you to use Kafka effectively with many other added functionalities. It showcases many examples for the topics covered in the chapter. , Building ETL Pipelines Using Kafka, introduces Kafka Connect, a common component, which for building ETL pipelines involving Kafka. It emphasizes how to use Kafka Connect in ETL pipelines and discusses some in-depth technical concepts surrounding it. , Building Streaming Applications Using Kafka Streams, is about how to build streaming applications using Kafka Stream, which is an integral part of the Kafka 0.10 release. This also covers building fast, reliable streaming applications using Kafka Stream, with examples. , Kafka Cluster Deployment, focuses on Kafka cluster deployment on enterprise- grade production systems. It covers in depth, Kafka clusters such as how to do capacity planning, how to manager single/multi cluster deployments, and so on. It also covers how to manage Kafka in multi-tenant environments. It further walks you through the various steps involved in Kafka data migrations. , Using Kafka in Big Data Applications, walks through some of the aspects of using Kafka in big data applications. This covers how to manage high volumes in Kafka, how to ensure guaranteed message delivery, the best ways to handle failures without any data loss, and some governance principles that can be applied while using Kafka in big data pipelines.

Preface [ 3 ] , Securing Kafka, is about securing your Kafka cluster. It covers authentication and authorization mechanisms along with examples. , Streaming Applications Design Considerations, is about different design considerations for building a streaming application. It walks you through aspects such as parallelism, memory tuning, and so on. It provides comprehensive coverage of the different paradigms for designing a streaming application. What you need for this book You will need the following software to work with the examples in this book: Apache Kafka, big data, Apache Hadoop, publish and subscribe, enterprise messaging system, distributed Streaming, Producer API, Consumer API, Streams API, Connect API Who this book is for If you want to learn how to use Apache Kafka and the various tools in the Kafka ecosystem in the easiest possible manner, this book is for you. Some programming experience with Java is required to get the most out of this book. Conventions In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The next lines of code read the link and assign it to the to the function." A block of code is set as follows: Any command-line input or output is written as follows: sudo su - hdfs -c "hdfs dfs -chmod 777 /tmp/hive" sudo chmod 777 /tmp/hive

Preface [ 4 ] New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to download new modules, we will go to Files | Settings | Project Name | Project Interpreter." Warnings or important notes appear in a box like this. Tips and tricks appear like this. Reader feedback Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail , and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at . Customer support Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase. Downloading the example code You can download the example code files for this book from your account at . If you purchased this book elsewhere, you can visit and register to have the files e-mailed directly to you.

Preface [ 5 ] You can download the code files by following these steps: Log in or register to our website using your e-mail address and password.1. Hover the mouse pointer on the SUPPORT tab at the top.2. Click on Code Downloads & Errata.3. Enter the name of the book in the Search box.4. Select the book for which you're looking to download the code files.5. Choose from the drop-down menu where you purchased this book from.6. Click on Code Download.7. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of: WinRAR / 7-Zip for Windows Zipeg / iZip / UnRarX for Mac 7-Zip / PeaZip for Linux The code bundle for the book is also hosted on GitHub at . We also have other code bundles from our rich catalog of books and videos available at . Check them out! Downloading the color images of this book We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from .

Preface [ 6 ] Errata Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code- we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting , selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to and enter the name of the book in the search field. The required information will appear under the Errata section. Piracy Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content. Questions If you have a problem with any aspect of this book, you can contact us at , and we will do our best to address the problem.

Statistics

Uploader

Building Data Streaming Applications with Apache Kafka (Kumar, Manish Singh, Chanchal) (z-library.sk, 1lib.sk, z-lib.sk)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Blog & Notes

Recommended for You

Statistics

Uploader

Building Data Streaming Applications with Apache Kafka (Kumar, Manish Singh, Chanchal) (z-library.sk, 1lib.sk, z-lib.sk)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Blog & Notes

Recommended for You