Statistics
46
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2025-12-07

AuthorRukmani Gopalan

More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights. This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance. • Learn the benefits of a cloud-based big data strategy for your organization • Get guidance and best practices for designing performant and scalable data lakes • Examine architecture and design choices, and data governance principles and strategies • Build a data strategy that scales as your organizational and business needs increase • Implement a scalable data lake in the cloud • Use cloud-based advanced analytics to gain more value from your data

Tags
No tags
ISBN: 1098116585
Publisher: O'Reilly Media
Publish Year: 2023
Language: 英文
Pages: 247
File Format: PDF
File Size: 7.1 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

G op a la n The C loud D a ta La ke The C loud D a ta La ke Rukmani Gopalan The Cloud Data Lake A Guide to Building Robust Cloud Data Architecture
DATA “The Cloud Data Lake provides the essential understanding needed to support data workloads in the cloud.” —Prasanna Sundararajan Principal Software Architect, Microsoft Azure “This book is a must read for every person in the big data field.” —Andrei Ionescu Senior Software Engineer, Adobe The Cloud Data Lake US $65.99 CAN $82.99 ISBN: 978-1-098-11658-3 Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia More than ever, organizations understand the importance of cloud data lake architectures for deriving value from their data. But building a robust, scalable, and performant data lake remains a complex proposition, given the large buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights. This practical book delivers a concise yet comprehensive overview of the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance. • Learn the benefits of a cloud-based big data strategy for your organization • Get guidance and best practices for designing performant and scalable data lakes • Examine architecture and design choices and data governance principles and strategies • Build a data strategy that scales as your organizational and business needs increase • Implement a scalable data lake in the cloud • Use cloud-based advanced analytics to gain more value from your data Rukmani Gopalan is a product management leader who has worked on data infrastructure and platforms at Microsoft and other startups. Her goal is to educate data architects and data developers on the various aspects of building cloud data lake platforms. Rukmani lives in Redmond, Washington, and enjoys exploring the Pacific Northwest, one conversation and one cup of coffee at a time. G op a la n The C loud D a ta La ke The C loud D a ta La ke
Praise for The Cloud Data Lake Rukmani gives the business and technical community a thoughtful and unbiased tour of modern data and analytics technologies. She uncovers first principles, empowering decision makers to understand if building a data lake makes sense for them. —Gordon Wong, Founder, Wong Decisions Highly recommended reading for cloud solution architects for understanding the emerging cloud data lake architectures. —Chidamber Kulkarni, Cloud Solutions Architect at Intel We are in the cloud era with almost unlimited cheap storage and lots of processing power, a time when companies want to migrate to the cloud. To have a successful story, those who make decisions need to understand what a data lake is; why, when, and where it is needed; what aspects can be tuned together; and their pros and cons. This book is the answer to this need. It is helpful that the book details the available table formats, cloud offerings, and frameworks that can be used to process data, the storage layer, and then how to put these together for a performant solution suited for your needs. The decision framework that Rukmani provides in the book will help you make an informed decision on which kind of data lake to choose. This book is a must read for every person in the big data field. —Andrei Ionescu, Senior Software Engineer, Adobe
With data analytics workloads migrating to the cloud, getting an understanding of end-to-end architecture provides the necessary context to make the right trade-offs to build and support required data infrastructure tailored to various use-cases. The Cloud Data Lake provided me with the essential understanding needed to support data workloads in the cloud. —Prasanna Sundararajan, Principal Software Architect, Microsoft Azure
Rukmani Gopalan The Cloud Data Lake A Guide to Building Robust Cloud Data Architecture Boston Farnham Sebastopol TokyoBeijing
978-1-098-11658-3 [LSI] The Cloud Data Lake by Rukmani Gopalan Copyright © 2023 Rukmani Gopalan. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Andy Kwan Development Editor: Jill Leonard Production Editor: Ashley Stussy Copyeditor: Shannon Turlington Proofreader: Piper Editorial Consulting, LLC Indexer: Sue Klefstad Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea December 2022: First Edition Revision History for the First Edition 2022-12-09: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098116583 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The Cloud Data Lake, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author(s), and do not represent the publisher’s views. While the publisher and the author(s) have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibil‐ ity for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Big Data—Beyond the Buzz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is Big Data? 2 Elastic Data Infrastructure—The Challenge 8 Cloud Computing Fundamentals 8 Cloud Computing Terminology 8 Value Proposition of the Cloud 10 Cloud Data Lake Architecture 12 Limitations of On-Premises Data Warehouse Solutions 13 What Is a Cloud Data Lake Architecture? 14 Benefits of a Cloud Data Lake Architecture 15 Defining Your Cloud Data Lake Journey 16 Summary 19 2. Big Data Architectures on the Cloud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Why Klodars Corporation Moves to the Cloud 22 Fundamentals of Cloud Data Lake Architectures 23 A Word on Variety of Data 23 Cloud Data Lake Storage 26 Big Data Analytics Engines 28 Cloud Data Warehouses 34 Modern Data Warehouse Architecture 36 Reference Architecture 36 Sample Use Case for a Modern Data Warehouse Architecture 38 Benefits and Challenges of Modern Data Warehouse Architecture 40 Data Lakehouse Architecture 40 Reference Architecture for the Data Lakehouse 41 v
Sample Use Case for Data Lakehouse Architecture 48 Benefits and Challenges of the Data Lakehouse Architecture 49 Data Warehouses and Unstructured Data 51 Data Mesh 51 Reference Architecture 53 Sample Use Case for a Data Mesh Architecture 54 Challenges and Benefits of a Data Mesh Architecture 55 What Is the Right Architecture for Me? 56 Know Your Customers 56 Know Your Business Drivers 57 Consider Your Growth and Future Scenarios 58 Design Considerations 58 Hybrid Approaches 60 Summary 61 3. Design Considerations for Your Data Lake. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Setting Up the Cloud Data Lake Infrastructure 63 Identify Your Goals 64 Plan Your Architecture and Deliverables 67 Implement the Cloud Data Lake 71 Release and Operationalize 72 Organizing Data in Your Data Lake 72 A Day in the Life of Data 73 Data Lake Zones 73 Organization Mechanisms 77 Introduction to Data Governance 78 Actors Involved in Data Governance 79 Data Classification 81 Metadata Management, Data Catalog, and Data Sharing 82 Data Access Management 83 Data Quality and Observability 85 Data Governance at Klodars Corporation 87 Data Governance Wrap-Up 88 Manage Data Lake Costs 89 Demystifying Data Lake Costs on the Cloud 90 Data Lake Cost Strategy 92 Summary 95 4. Scalable Data Lakes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A Sneak Peek into Scalability 97 What Is Scalability? 98 Scale in Our Day-to-Day Life 98 vi | Table of Contents
Scalability in Data Lake Architectures 101 Internals of Data Lake Processing Systems 104 Data Copy Internals 106 ELT/ETL Processing Internals 108 A Note on Other Interactive Queries 111 Considerations for Scalable Data Lake Solutions 111 Pick the Right Cloud Offerings 111 Plan for Peak Capacity 115 Data Formats and Job Profile 117 Summary 117 5. Optimizing Cloud Data Lake Architectures for Performance. . . . . . . . . . . . . . . . . . . . . . 119 Basics of Measuring Performance 119 Goals and Metrics for Performance 121 Measuring Performance 122 Optimizing for Faster Performance 123 Cloud Data Lake Performance 125 SLAs, SLOs, and SLIs 125 Example: How Klodars Corporation Managed Its SLAs, SLOs, and SLIs 126 Drivers of Performance 128 Performance Drivers for a Copy Job 128 Performance Drivers for a Spark Job 130 Optimization Principles and Techniques for Performance Tuning 134 Data Formats 134 Data Organization and Partitioning 140 Choosing the Right Configurations on Apache Spark 142 Minimize Overheads with Data Transfer 145 Premium Offerings and Performance 146 The Case of Bigger Virtual Machines 146 The Case of Flash Storage 146 Summary 147 6. Deep Dive on Data Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Why Do We Need These Open Data Formats? 149 Why Do We Need to Store Tabular Data? 150 Why Is It a Problem to Store Tabular Data in a Cloud Data Lake Storage? 151 Delta Lake 152 Why Was Delta Lake Founded? 152 How Does Delta Lake Work? 155 When Do You Use Delta Lake? 157 Apache Iceberg 157 Why Was Apache Iceberg Founded? 157 Table of Contents | vii
How Does Apache Iceberg Work? 159 When Do You Use Apache Iceberg? 161 Apache Hudi 162 Why Was Apache Hudi Founded? 163 How Does Apache Hudi Work? 164 When Do You Use Apache Hudi? 167 Summary 168 7. Decision Framework for Your Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Cloud Data Lake Assessment 170 Cloud Data Lake Assessment Questionnaire 170 Analysis for Your Cloud Data Lake Assessment 172 Starting from Scratch 173 Migrating an Existing Data Lake or Data Warehouse to the Cloud 173 Improving an Existing Cloud Data Lake 174 Phase 1 of Decision Framework: Assess 175 Understand Customer Requirements 176 Understand Opportunities for Improvement 177 Know Your Business Drivers 178 Complete the Assess Phase by Prioritizing the Requirements 179 Phase 2 of Decision Framework: Define 180 Finalize the Design Choices for the Cloud Data Lake 182 Plan Your Cloud Data Lake Project Deliverables 186 Phase 3 of Decision Framework: Implement 187 Phase 4 of Decision Framework: Operationalize 190 Summary 190 8. Six Lessons for a Data Informed Future. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Lesson 1: Focus on the How and When, Not the If and Why, When It Comes to Cloud Data Lakes 192 Lesson 2: With Great Power Comes Great Responsibility—Data Is No Exception 193 Lesson 3: Customers Lead Technology, Not the Other Way Around 195 Lesson 4: Change Is Inevitable, so Be Prepared 196 Lesson 5: Build Empathy and Prioritize Ruthlessly 197 Lesson 6: Big Impact Does Not Happen Overnight 198 Summary 199 Appendix. Cloud Data Lake Decision Framework Template. . . . . . . . . . . . . . . . . . . . . . . . 201 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 viii | Table of Contents
Preface Its six in the morning; your phone gently wakes you up and automatically turns on your notifications. Your smart refrigerator reminds you that you need to order milk and shows you an option to place an order to buy more since it knows you are running low. You do that and hop on your exercise machine, where you see personalized picks based on your workout routines. You get ready and eat breakfast without bothering to look at the clock because you know your phone will tell you when it is time to start driving based on what it has learned about your commute and the traffic patterns. As you leave, your smart home assistant ensures the lights are turned off and the doors are locked. What would have seemed like science fiction a few decades ago is a regular day in our lives now. All this is possible because of the leaps technology has made in three key areas: devices that have made computing ubiquitous, connectivity that has shrunk the world by bringing the knowledge of the internet to these devices, and technology (data, artificial intelligence, machine learning) that has helped devices learn patterns and make decisions. Data is now at the heart of how the world operates, and organizations increasingly rely on data to both inform and transform their businesses. My mind goes back to 2013, when my own personal journey with data started as I worked on identity and personalization services for Microsoft Office. It was a year of a great many learnings for me. I understood what it meant to develop cloud-based applications, including the nuances of building a direct-to-consumer experience versus an enterprise-ready application. Most of all, though, I was thrilled at the possibility of having a direct connection to customer experiences from these cloud services. When we shipped boxed products (i.e., products that shipped in a CD or DVD) and had customers install them on their devices, the only way for us to understand their experiences was to get anonymized telemetry data, to organize user research studies or focus groups, or reading through support cases when the customer had issues. Many of our insights on product usage were based on data from the customers who opted to talk to us, which was a minute fraction. With the cloud services I built, I had a real-time understanding of my customers. This helped us tune ix
our services and deliver more personalized experiences to our users. We were able to experiment with variations of features with our customers to better understand what helps more with their productivity. Since then, I have been working on various platforms and cloud services, and I realize how the value of the data, when amplified by the elasticity of the cloud, can help inform and transform businesses. Why I Wrote This Book I have engaged with hundreds of customers over the years across various industries— health care, consumer goods, retail, and manufacturing, to name a few—and I have helped them with their big data analytics needs on the cloud. I have also driven the migration of my organization’s on-premises analytics workload to the cloud for better cost management as well as to take advantage of emerging technologies in machine learning. Understandably so, each of these customers comes to me with different motivations and problems. However, one common thread binds them all: the strong desire to get value out of their data. The same customers who I was talking to about the fundamentals of big data analytics five years ago have now progressed to operating very mature implementations and running more of their business-critical workloads on the data lake. As part of these conversations, there have been a few key questions that boil down to setting up, organizing, securing, and optimizing data lake implementations. In the ideal scenario, these considerations are baked into the data lake architecture design, and in some unfortunate instances, we talk about these issues when customers have a problem forcing a rearchitecture or redesign. The promise of the infinite possibilities of leveraging a cloud data lake comes with the flip side of understanding and handling the complexities involved in building and operationalizing a cloud data lake application. I believe that while the industry works on simplifying this process over time, a foundational understanding of the concepts of a cloud data lake solution goes a long way toward building robust data lake architectures that stand the test of time. I have thoroughly enjoyed helping my customers, partners, and teams build this foundational understanding and watching them become completely empowered to drive transformational insights for their teams or organizations. In this book, I hope to condense all these conversations and the associated lessons learned to provide an approach for data practitioners that will help you design a scalable cloud data lake architecture that informs and transforms your business. x | Preface
Who Should Read This Book? This book is primarily targeted at data architects, data developers, and data ops professionals who want to get a broad understanding of the various aspects of setting up and operating their cloud data lake. At the end of this book, you will have an understanding of the following: • The benefits of a cloud-based big data strategy for your organization • Architecture and design choices, including the modern data warehouse, data lakehouse, and data mesh • Guidance and best practices for designing performant and scalable data lakes • Data governance principles, strategies, and design choices Whether you are taking your first steps or looking at modernizing your data lake on the cloud, my hope is that you will be prepared to have an informed, educated design conversation with your cloud provider and your engineering teams, and you will be able to plan and budget for your engineering investments in terms of time, effort, and money. Big data analytics is one of the areas where development, technologies, and paradigm shifts happen in the blink of an eye. To me, this illustrates the abundant opportunities that are now possible. I will keep the considerations neutral of any specific technology, so when a new technology emerges, we will be able to apply these fundamentals in the context of all the available technology choices. Introducing Klodars Corporation In this book, we will apply the concepts of the cloud data lake to a fictitious organiza‐ tion, Klodars Corporation, to best illustrate them using a business problem that will resonate with most of us. Klodars Corporation is a fictitious organization that sells umbrellas and rain gear in Seattle, Washington (cliche much?). In addition to website sales, Klodars employs salespeople to reach out to retailers to sell its umbrellas as a bulk distribution in the Seattle area. It has a small software development team that writes applications to manage inventory and sales, leveraging SQL server as the operational database running on servers that are maintained in its offices. It also leverages Salesforce to manage its customer profiles and interactions. Because of the quality of its rain gear and excellent sales channels, Klodars Corpora‐ tion is rapidly expanding across the state of Washington as well as in the neighboring states of Oregon and Idaho. Its direct-to-consumer business is taking off through its website, and its marketing department is running excellent campaigns on social media. In addition, Klodars wants to expand its business to sell winter gear based Preface | xi
on customer demand. So it plans to acquire another business that sells winter gear. While this is amazing news for the business, it is at that inflection point where its database technology doesn’t quite scale to its increasing needs, and it is evaluating a move to the cloud. Navigating the Book While I recommend that you read this book end to end for a complete understand‐ ing, each chapter is self-contained, and you can focus on specific topics depending on what is top of your mind. You can also come back to this book at any point to reference specific sections without having to read from the beginning. • At the end of Chapter 1, you will get an overall understanding of what cloud data lake means and its benefits. You will also understand that moving to the cloud involves thinking through design considerations and making an informed choice, as opposed to going with a lift-and-shift approach. • In Chapter 2, I will go over the various cloud data lake architectures, and you will understand the value proposition of each architecture. At the end of this chapter, you will be able to build on the foundational understanding of Chapter 1 and know about the scenarios that these cloud architectures solve as well as get concrete examples of how an organization can leverage these architectures. • Data is the new gold, oil, bacon…insert your favorite metaphor here. The key to a cloud data lake architecture is a robust design of your data layer, which lays the foundation of every scenario you build on it. Chapter 3 will get into the details of the foundational layer of your data lake and the various aspects of designing, organizing, and managing your data in the data lake. I strongly advise that you give this chapter a lot of attention to help you design your data lake not only to meet your immediate needs but also to scale as your business grows. • In Chapter 4, I will talk about the various considerations for designing your data lake for scale. I will also provide a set of best practices for you to consider as you are building your data estate and data pipelines. Chapters 5 and 6 will deep dive into two aspects: tuning your cloud data lake to meet the desired performance and data formats that serve as critical building blocks for performance. • In Chapter 7, based on the learnings from the chapters before, I will introduce a decision framework that you can use to make the right choices for your data lake architecture. I will also provide a checklist you can use for an easy reference. • Chapter 8 is a catchall section for questions that may not have been answered earlier in the book. As I mentioned before, the data lake community is growing and rapidly innovating as we learn more every day. You have an opportunity to influence these innovations and bring your own ideas to the table. In the xii | Preface
meantime, let us focus on progress, not perfection; there is ample value that comes out of just this progression. In summary, after reading this book, you will understand the fundamentals of every‐ thing it takes to build a cloud data lake and will be able to apply this understanding in many ways, including the following: • Use the design choices in the book to build out a data strategy that scales as the organizational and business needs grow • Pitch to key decision makers how a lean data platform team can drive key business transformations using a robust data strategy • Empower your organization to focus on the key business problems with a scala‐ ble data infrastructure • Realize more value from data using advanced analytics offerings on the cloud Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. Preface | xiii
This element indicates a warning or caution. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/the-cloud-data-lake-1e. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://www.youtube.com/oreillymedia xiv | Preface
Acknowledgments As the proverbial idiom goes, it takes a village to write a book, and I’m eternally grateful to the multitude of folks who helped in making this book a reality. First and foremost, I would like to deeply thank my teams, customers, and partners at Microsoft during my tenure with Microsoft Office, Azure HDInsight, and Azure Data Lake Storage/Cosmos for building my understanding of the data space and trusting my instincts and approach as I impacted the transformational insights of various organizations with these offerings. The list of people here is so long that I could fill a book with all their names. Thanks to Tomer Shiran from Dremio and the team at Monte Carlo—Barr Moses, Lior Gavish, and Molly Vorwerck—for your enlightening interviews on data lake‐ house and data observability that turned into great sidebars for the book. The team at O’Reilly has been amazing at helping me shape my thoughts and approach into this book. Jill Leonard and Andy Kwan—thanks for being there every step of the way, whether it was discussing the structuring of certain topics and the appropriate level of details or, many times, helping me with my multiple bouts of impostor syndrome. A huge shoutout to the tech reviewers who took the time to read through the book and shared their very valuable insights and feedback: Shreya Pal, Andrei Ionescu, Alicia Moniz, Prasanna Sundararajan, Chidamber Kulkarni, Gordon Wong, Gareth Eager, Vinoth Chandar, and Vini Jaiswal. You really helped me understand a reader’s journey, which offered valuable lessons for life. Finally, no words will ever express the gratitude I feel toward my family—my awe‐ some husband, Sriram Govindarajan, and my amazing kids, Anish Bharadwaj and Dhanya Bharadwaj, for being the constant source of inspiration and support for me, not just for this book but for life. Thanks to Janaki Gopalan and Gopalan Krishnamachari, my mom and dad, who are not with me anymore physically but stay with me forever in the core values they have instilled in me around hard work, accountability, and unconditional giving. Preface | xv
(This page has no text content)
CHAPTER 1 Big Data—Beyond the Buzz Without big data, you are blind and deaf and in the middle of a freeway. —Geoffrey Moore If we were playing workplace bingo, there is a big chance you would win by crossing off all these terms that you have heard in your organization in the past three months: digital transformation, data strategy, transformational insights, data lake, warehouse, data science, machine learning, and intelligence. It is now common knowledge that data is a key ingredient for organizations to succeed, and organizations that rely on data and AI clearly outperform their contenders. According to an IDC study sponsored by Seagate, the amount of data that is captured, collected, or replicated is expected to grow to 175 zettabytes (ZB) by the year 2025. This data that is captured, collected, or replicated is referred to as the Global DataSphere. This data comes from three classes of sources: The core Traditional or cloud-based datacenters The edge Hardened infrastructure, such as cell towers The endpoints PCs, tablets, smartphones, and Internet of Things (IoT) devices This study also predicts that 49% of this Global DataSphere will be residing in public cloud environments by the year 2025. If you have ever wondered, “Why does this data need to be stored? What is it good for?” the answer is very simple. Think of all of this data as pieces of words strewn around the globe in different languages, each sharing a sliver of information, like pieces of a puzzle. Stitching them together in a meaningful fashion tells a story that 1
not only informs but also could transform businesses, people, and even the way the world runs. Most successful organizations already leverage data to understand the growth drivers for their businesses and perceived customer experiences and to take the rightful action; looking at “the funnel,” or customer acquisition, adoption, engagement, and retention, is now largely the lingua franca of funding product investments. These types of data processing and analysis are referred to as business intelligence, or BI, and classified as “offline insights.” Essentially, the data and the insights are crucial in presenting the trend that shows growth so business leaders can take action; however, this workstream is separate from the core business logic used to run the business itself. As the maturity of the data platform grows, an inevitable signal we get from all customers is that they start getting more requests to run more scenarios on their data lakes, truly adhering to the idiom “Data is the new oil.” Organizations leverage data to understand the growth drivers for their businesses and perceived customer experience. They can then use data to set targets and drive improvements in customer experience with better support and newer features. They can also create better marketing strategies to grow their businesses and drive efficien‐ cies to lower their costs of building their products and organizations. Starbucks, the coffee shop that is present around the globe, uses data in every place possible to continuously measure and improve its business. As explained in this YouTube video, Starbucks uses the data from its mobile applications and correlates the data with its ordering system to better understand customer usage patterns and send targeted marketing campaigns. It uses sensors on its coffee machines that emit health data every few seconds, and this data is analyzed to drive improvements into its predictive maintenance. It also uses these connected coffee machines to download recipes to them without involving human intervention. As the world is just learning to cope with the COVID-19 pandemic, organizations are leveraging data heavily not only to transform their businesses but also to measure the health and productivity of their organizations to help their employees feel connected and minimize burnout. Overall, data is also used for world-saving initiatives such as Project Zamba, which uses AI for wildlife research and conservation in the remote jungles of Africa, as well as leverages IoT and data science to create a circular economy to promote environmental sustainability. What Is Big Data? All of the examples given previously share a few things in common: • These scenarios illustrate that data can be explored and consumed in a variety of ways, and when the data is generated, there is not really a clear idea of the consumption patterns. This is different from traditional online transaction processing (OLTP) and online analytical processing (OLAP) systems, where the data is specifically designed and curated to solve specific business problems. 2 | Chapter 1: Big Data—Beyond the Buzz