Gerard Maas & François Garillot Stream Processing with Apache Spark Best Practices for Scaling and Optimizing Apache Spark
(This page has no text content)
Gerard Maas and François Garillot Stream Processing with Apache Spark Mastering Structured Streaming and Spark Streaming Boston Farnham Sebastopol TokyoBeijing
978-1-491-94424-0 [LSI] Stream Processing with Apache Spark by Gerard Maas and François Garillot Copyright © 2019 François Garillot and Gerard Maas Images. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Rachel Roumeliotis Developmental Editor: Jeff Bleiel Production Editor: Nan Barber Copyeditor: Octal Publishing Services, LLC Proofreader: Kim Cofer Indexer: Judith McConville Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest June 2019: First Edition Revision History for the First Edition 2019-06-12: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491944240 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Stream Processing with Apache Spark, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Part I. Fundamentals of Stream Processing with Apache Spark 1. Introducing Stream Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What Is Stream Processing? 4 Batch Versus Stream Processing 5 The Notion of Time in Stream Processing 5 The Factor of Uncertainty 6 Some Examples of Stream Processing 7 Scaling Up Data Processing 8 MapReduce 8 The Lesson Learned: Scalability and Fault Tolerance 9 Distributed Stream Processing 10 Stateful Stream Processing in a Distributed System 10 Introducing Apache Spark 11 The First Wave: Functional APIs 11 The Second Wave: SQL 11 A Unified Engine 12 Spark Components 12 Spark Streaming 14 Structured Streaming 14 Where Next? 15 iii
2. Stream-Processing Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Sources and Sinks 17 Immutable Streams Defined from One Another 18 Transformations and Aggregations 19 Window Aggregations 19 Tumbling Windows 20 Sliding Windows 20 Stateless and Stateful Processing 21 Stateful Streams 21 An Example: Local Stateful Computation in Scala 23 A Stateless Definition of the Fibonacci Sequence as a Stream Transformation 24 Stateless or Stateful Streaming 25 The Effect of Time 25 Computing on Timestamped Events 26 Timestamps as the Provider of the Notion of Time 26 Event Time Versus Processing Time 27 Computing with a Watermark 30 Summary 32 3. Streaming Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Components of a Data Platform 34 Architectural Models 36 The Use of a Batch-Processing Component in a Streaming Application 36 Referential Streaming Architectures 37 The Lambda Architecture 37 The Kappa Architecture 38 Streaming Versus Batch Algorithms 39 Streaming Algorithms Are Sometimes Completely Different in Nature 40 Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms 41 Summary 42 4. Apache Spark as a Stream-Processing Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 The Tale of Two APIs 45 Spark’s Memory Usage 47 Failure Recovery 47 Lazy Evaluation 47 Cache Hints 47 Understanding Latency 48 Throughput-Oriented Processing 49 Spark’s Polyglot API 50 iv | Table of Contents
Fast Implementation of Data Analysis 50 To Learn More About Spark 51 Summary 51 5. Spark’s Distributed Processing Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Running Apache Spark with a Cluster Manager 53 Examples of Cluster Managers 54 Spark’s Own Cluster Manager 55 Understanding Resilience and Fault Tolerance in a Distributed System 56 Fault Recovery 57 Cluster Manager Support for Fault Tolerance 57 Data Delivery Semantics 58 Microbatching and One-Element-at-a-Time 61 Microbatching: An Application of Bulk-Synchronous Processing 61 One-Record-at-a-Time Processing 62 Microbatching Versus One-at-a-Time: The Trade-Offs 63 Bringing Microbatch and One-Record-at-a-Time Closer Together 64 Dynamic Batch Interval 64 Structured Streaming Processing Model 65 The Disappearance of the Batch Interval 65 6. Spark’s Resilience Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Resilient Distributed Datasets in Spark 67 Spark Components 70 Spark’s Fault-Tolerance Guarantees 71 Task Failure Recovery 72 Stage Failure Recovery 73 Driver Failure Recovery 73 Summary 75 A. References for Part I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Part II. Structured Streaming 7. Introducing Structured Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 First Steps with Structured Streaming 84 Batch Analytics 85 Streaming Analytics 88 Connecting to a Stream 88 Preparing the Data in the Stream 89 Operations on Streaming Dataset 90 Table of Contents | v
Creating a Query 90 Start the Stream Processing 91 Exploring the Data 92 Summary 93 8. The Structured Streaming Programming Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Initializing Spark 96 Sources: Acquiring Streaming Data 96 Available Sources 98 Transforming Streaming Data 98 Streaming API Restrictions on the DataFrame API 99 Sinks: Output the Resulting Data 101 format 102 outputMode 103 queryName 103 option 104 options 104 trigger 104 start() 105 Summary 105 9. Structured Streaming in Action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Consuming a Streaming Source 108 Application Logic 110 Writing to a Streaming Sink 110 Summary 112 10. Structured Streaming Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Understanding Sources 113 Reliable Sources Must Be Replayable 114 Sources Must Provide a Schema 115 Available Sources 117 The File Source 118 Specifying a File Format 118 Common Options 119 Common Text Parsing Options (CSV, JSON) 120 JSON File Source Format 121 CSV File Source Format 123 Parquet File Source Format 124 Text File Source Format 124 The Kafka Source 125 Setting Up a Kafka Source 126 vi | Table of Contents
Selecting a Topic Subscription Method 127 Configuring Kafka Source Options 128 Kafka Consumer Options 129 The Socket Source 130 Configuration 131 Operations 132 The Rate Source 132 Options 133 11. Structured Streaming Sinks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Understanding Sinks 135 Available Sinks 136 Reliable Sinks 136 Sinks for Experimentation 137 The Sink API 137 Exploring Sinks in Detail 137 The File Sink 138 Using Triggers with the File Sink 139 Common Configuration Options Across All Supported File Formats 141 Common Time and Date Formatting (CSV, JSON) 142 The CSV Format of the File Sink 142 The JSON File Sink Format 143 The Parquet File Sink Format 143 The Text File Sink Format 144 The Kafka Sink 144 Understanding the Kafka Publish Model 144 Using the Kafka Sink 145 The Memory Sink 148 Output Modes 149 The Console Sink 149 Options 149 Output Modes 149 The Foreach Sink 150 The ForeachWriter Interface 150 TCP Writer Sink: A Practical ForeachWriter Example 151 The Moral of this Example 154 Troubleshooting ForeachWriter Serialization Issues 155 12. Event Time–Based Stream Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Understanding Event Time in Structured Streaming 158 Using Event Time 159 Processing Time 160 Table of Contents | vii
Watermarks 160 Time-Based Window Aggregations 161 Defining Time-Based Windows 162 Understanding How Intervals Are Computed 163 Using Composite Aggregation Keys 163 Tumbling and Sliding Windows 164 Record Deduplication 165 Summary 166 13. Advanced Stateful Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Example: Car Fleet Management 168 Understanding Group with State Operations 168 Internal State Flow 169 Using MapGroupsWithState 170 Using FlatMapGroupsWithState 173 Output Modes 176 Managing State Over Time 176 Summary 179 14. Monitoring Structured Streaming Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 The Spark Metrics Subsystem 182 Structured Streaming Metrics 182 The StreamingQuery Instance 183 Getting Metrics with StreamingQueryProgress 184 The StreamingQueryListener Interface 186 Implementing a StreamingQueryListener 186 15. Experimental Areas: Continuous Processing and Machine Learning. . . . . . . . . . . . . . 189 Continuous Processing 189 Understanding Continuous Processing 189 Using Continuous Processing 192 Limitations 192 Machine Learning 193 Learning Versus Exploiting 193 Applying a Machine Learning Model to a Stream 194 Example: Estimating Room Occupancy by Using Ambient Sensors 195 Online Training 198 viii | Table of Contents
B. References for Part II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Part III. Spark Streaming 16. Introducing Spark Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 The DStream Abstraction 206 DStreams as a Programming Model 206 DStreams as an Execution Model 207 The Structure of a Spark Streaming Application 208 Creating the Spark Streaming Context 209 Defining a DStream 209 Defining Output Operations 210 Starting the Spark Streaming Context 210 Stopping the Streaming Process 211 Summary 211 17. The Spark Streaming Programming Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 RDDs as the Underlying Abstraction for DStreams 213 Understanding DStream Transformations 216 Element-Centric DStream Transformations 218 RDD-Centric DStream Transformations 220 Counting 221 Structure-Changing Transformations 222 Summary 222 18. The Spark Streaming Execution Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 The Bulk-Synchronous Architecture 225 The Receiver Model 227 The Receiver API 227 How Receivers Work 228 The Receiver’s Data Flow 228 The Internal Data Resilience 230 Receiver Parallelism 230 Balancing Resources: Receivers Versus Processing Cores 231 Achieving Zero Data Loss with the Write-Ahead Log 232 The Receiverless or Direct Model 233 Summary 234 19. Spark Streaming Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Types of Sources 238 Basic Sources 238 Table of Contents | ix
Receiver-Based Sources 238 Direct Sources 239 Commonly Used Sources 239 The File Source 240 How It Works 241 The Queue Source 243 How It Works 244 Using a Queue Source for Unit Testing 244 A Simpler Alternative to the Queue Source: The ConstantInputDStream 245 The Socket Source 248 How It Works 248 The Kafka Source 249 Using the Kafka Source 252 How It Works 253 Where to Find More Sources 253 20. Spark Streaming Sinks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Output Operations 255 Built-In Output Operations 257 print 257 saveAsxyz 258 foreachRDD 259 Using foreachRDD as a Programmable Sink 260 Third-Party Output Operations 263 21. Time-Based Stream Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Window Aggregations 265 Tumbling Windows 266 Window Length Versus Batch Interval 266 Sliding Windows 267 Sliding Windows Versus Batch Interval 267 Sliding Windows Versus Tumbling Windows 268 Using Windows Versus Longer Batch Intervals 268 Window Reductions 269 reduceByWindow 270 reduceByKeyAndWindow 270 countByWindow 270 countByValueAndWindow 271 Invertible Window Aggregations 271 Slicing Streams 273 Summary 273 x | Table of Contents
22. Arbitrary Stateful Streaming Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Statefulness at the Scale of a Stream 275 updateStateByKey 276 Limitation of updateStateByKey 278 Performance 278 Memory Usage 279 Introducing Stateful Computation with mapwithState 279 Using mapWithState 281 Event-Time Stream Computation Using mapWithState 283 23. Working with Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Spark SQL 288 Accessing Spark SQL Functions from Spark Streaming 289 Example: Writing Streaming Data to Parquet 289 Dealing with Data at Rest 293 Using Join to Enrich the Input Stream 293 Join Optimizations 296 Updating Reference Datasets in a Streaming Application 298 Enhancing Our Example with a Reference Dataset 299 Summary 301 24. Checkpointing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Understanding the Use of Checkpoints 304 Checkpointing DStreams 309 Recovery from a Checkpoint 310 Limitations 311 The Cost of Checkpointing 312 Checkpoint Tuning 312 25. Monitoring Spark Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 The Streaming UI 316 Understanding Job Performance Using the Streaming UI 318 Input Rate Chart 318 Scheduling Delay Chart 319 Processing Time Chart 320 Total Delay Chart 320 Batch Details 321 The Monitoring REST API 322 Using the Monitoring REST API 323 Information Exposed by the Monitoring REST API 323 The Metrics Subsystem 324 The Internal Event Bus 326 Table of Contents | xi
Interacting with the Event Bus 327 Summary 330 26. Performance Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 The Performance Balance of Spark Streaming 333 The Relationship Between Batch Interval and Processing Delay 334 The Last Moments of a Failing Job 334 Going Deeper: Scheduling Delay and Processing Delay 335 Checkpoint Influence in Processing Time 336 External Factors that Influence the Job’s Performance 337 How to Improve Performance? 338 Tweaking the Batch Interval 338 Limiting the Data Ingress with Fixed-Rate Throttling 339 Backpressure 339 Dynamic Throttling 340 Tuning the Backpressure PID 341 Custom Rate Estimator 342 A Note on Alternative Dynamic Handling Strategies 342 Caching 342 Speculative Execution 344 C. References for Part III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Part IV. Advanced Spark Streaming Techniques 27. Streaming Approximation and Sampling Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . 351 Exactness, Real Time, and Big Data 352 Exactness 352 Real-Time Processing 352 Big Data 353 The Exactness, Real-Time, and Big Data triangle 353 Big Data and Real Time 354 Approximation Algorithms 356 Hashing and Sketching: An Introduction 356 Counting Distinct Elements: HyperLogLog 357 Role-Playing Exercise: If We Were a System Administrator 358 Practical HyperLogLog in Spark 361 Counting Element Frequency: Count Min Sketches 365 Introducing Bloom Filters 366 Bloom Filters with Spark 367 Computing Frequencies with a Count-Min Sketch 367 xii | Table of Contents
Ranks and Quantiles: T-Digest 370 T-Digest in Spark 372 Reducing the Number of Elements: Sampling 373 Random Sampling 373 Stratified Sampling 374 28. Real-Time Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 Streaming Classification with Naive Bayes 378 streamDM Introduction 380 Naive Bayes in Practice 381 Training a Movie Review Classifier 382 Introducing Decision Trees 383 Hoeffding Trees 385 Hoeffding Trees in Spark, in Practice 387 Streaming Clustering with Online K-Means 388 K-Means Clustering 388 Online Data and K-Means 389 The Problem of Decaying Clusters 390 Streaming K-Means with Spark Streaming 393 D. References for Part IV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Part V. Beyond Apache Spark 29. Other Distributed Real-Time Stream Processing Systems. . . . . . . . . . . . . . . . . . . . . . . 399 Apache Storm 399 Processing Model 400 The Storm Topology 400 The Storm Cluster 401 Compared to Spark 401 Apache Flink 402 A Streaming-First Framework 402 Compared to Spark 403 Kafka Streams 404 Kafka Streams Programming Model 404 Compared to Spark 404 In the Cloud 405 Amazon Kinesis on AWS 405 Microsoft Azure Stream Analytics 406 Apache Beam/Google Cloud Dataflow 407 Table of Contents | xiii
30. Looking Ahead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 Stay Plugged In 412 Seek Help on Stack Overflow 412 Start Discussions on the Mailing Lists 412 Attend Conferences 413 Attend Meetups 413 Read Books 413 Contributing to the Apache Spark Project 413 E. References for Part V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 xiv | Table of Contents
Foreword Welcome to Stream Processing with Apache Spark! It’s very exciting to see how much both the Apache Spark project, as well as stream processing with Apache Spark have come along since it was first started by Matei Zaharia at University of California Berkeley in 2009. Apache Spark started off as the first unified engine for big data processing and has grown into the de-facto standard for all things big data. Stream Processing with Apache Spark is an excellent introduction to the concepts, tools, and capabilities of Apache Spark as a stream processing engine. This book will first introduce you to the core Spark concepts necessary to understand modern dis‐ tributed processing. Then it will explore different stream processing architectures and the fundamental architectural trade-offs between then. Finally, it will illustrate how Structured Streaming in Apache Spark makes it easy to implement distributed streaming applications. In addition, it will also cover the older Spark Streaming (aka, DStream) APIs for building streaming applications with legacy connectors. In all, this book covers everything you’ll need to know to master building and operat‐ ing streaming applications using Apache Spark! We look forward to hearing about what you’ll build! — Tathagata Das Cocreator of Spark Streaming and Structured Streaming — Michael Armbrust Cocreator of Spark SQL and Structured Streaming xv
— Bill Chambers Coauthor of Spark: The Definitive Guide May 2019 xvi | Foreword
Preface Who Should Read This Book? We created this book for software professionals who have an affinity for data and who want to improve their knowledge and skills in the area of stream processing, and who are already familiar with or want to use Apache Spark for their streaming applica‐ tions. We have included a comprehensive introduction to the concepts behind stream pro‐ cessing. These concepts form the foundations to understand the two streaming APIs offered by Apache Spark: Structured Streaming and Spark Streaming. We offer an in-depth exploration of these APIs and provide insights into their fea‐ tures, application, and practical advice derived from our experience. Beyond the coverage of the APIs and their practical applications, we also discuss sev‐ eral advanced techniques that belong in the toolbox of every stream-processing prac‐ titioner. Readers of all levels will benefit from the introductory parts of the book, whereas more experienced professionals will draw new insights from the advanced techniques covered and will receive guidance on how to learn more. We have made no assumptions about your required knowledge of Spark, but readers who are not familiar with Spark’s data-processing capabilities should be aware that in this book, we focus on its streaming capabilities and APIs. For a more general view of the Spark capabilities and ecosystem, we recommend Spark: The Definitive Guide by Bill Chambers and Matei Zaharia (O’Reilly). The programming language used across the book is Scala. Although Spark provides bindings in Scala, Java, Python, and R, we think that Scala is the language of choice for streaming applications. Even though many of the code samples could be trans‐ lated into other languages, some areas, such as complex stateful computations, are best approached using the Scala programming language. xvii
Installing Spark Spark is an Apache open source project hosted officially by the Apache Foundation, but which mostly uses GitHub for its development. You can also download it as a binary, pre-compiled package at the following address: https://spark.apache.org/down loads.html. From there, you can begin running Spark on one or more machines, which we will explain later. Packages exist for all of the major Linux distributions, which should help installation. For the purposes of this book, we use examples and code compatible with Spark 2.4.0, and except for minor output and formatting details, those examples should stay com‐ patible with future Spark versions. Note, however, that Spark is a program that runs on the Java Virtual Machine (JVM), which you should install and make accessible on every machine on which any Spark component will run. To install a Java Development Kit (JDK), we recommend OpenJDK, which is pack‐ aged on many systems and architectures, as well. You can also install the Oracle JDK. Spark, as any Scala program, runs on any system on which a JDK version 6 or later is present. The recommended Java runtime for Spark depends on the version: • For Spark versions below 2.0, Java 7 is the recommended version. • For Spark versions 2.0 and above, Java 8 is the recommended version. Learning Scala The examples in this book are in Scala. This is the implementation language of core Spark, but it is by far not the only language in which it can be used; as of this writing, Spark offers APIs in Python, Java, and R. Scala is one of the most feature-complete programming languages today, in that it offers both functional and object-oriented aspects. Yet, its concision and type infer‐ ence makes the basic elements of its syntax easy to understand. Scala as a beginner language has many advantages from a pedagogical viewpoint, its regular syntax and semantics being one of the most important. —Björn Regnell, Lund University Hence, we hope the examples will stay clear enough for any reader to pick up their meanings. However, for the readers who might want a primer on the language and xviii | Preface
Comments 0
Loading comments...
Reply to Comment
Edit Comment