Statistics
18
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2026-01-07

AuthorJames Serra

Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they're also surrounded by a lot of hyperbole and confusion. This practical book provides a guided tour of each architecture to help data professionals understand its pros and cons. In the process, James Serra, big data and data warehousing solution architect at Microsoft, examines common data architecture concepts, including how data warehouses have had to evolve to work with data lake features. You'll learn what data lakehouses can help you achieve, and how to distinguish data mesh hype from reality. Best of all, you'll be able to determine the most appropriate data architecture for your needs. By reading this book, you'll: Gain a working understanding of several data architectures Know the pros and cons of each approach Distinguish data architecture theory from the reality Learn to pick the best architecture for your use case Understand the differences between data warehouses and data lakes Learn common data architecture concepts to help you build better solutions Alleviate confusion by clearly defining each data architecture Know what architectures to use for each cloud provider

Tags
No tags
ISBN: 1098150767
Publisher: O'Reilly Media
Publish Year: 2024
Language: 英文
Pages: 278
File Format: PDF
File Size: 6.7 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

James Serra Deciphering Data Architectures Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh
DATA “His ability to transform complex technical concepts into clear, easy-to-grasp explanations is truly remarkable.” —Annie Xu Senior Data Customer Engineer, Google “Put it on your desk— you’ll reference it often.” —Sawyer Nyquist Owner, Writer, and Consultant, The Data Shop Deciphering Data Architectures Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they’re also surrounded by a lot of hyperbole and confusion. This practical book provides a guided tour of these architectures to help data professionals understand the pros and cons of each. James Serra, big data and data warehousing solution architect at Microsoft, examines common data architecture concepts, including how data warehouses have had to evolve to work with data lake features. You’ll learn what data lakehouses can help you achieve, as well as how to distinguish data mesh hype from reality. Best of all, you’ll be able to determine the most appropriate data architecture for your needs. With this book, you’ll: • Gain a working understanding of several data architectures • Learn the strengths and weaknesses of each approach • Distinguish data architecture theory from reality • Pick the best architecture for your use case • Understand the differences between data warehouses and data lakes • Learn common data architecture concepts to help you build better solutions • Explore the historical evolution and characteristics of data architectures • Learn essentials of running an architecture design session, team organization, and project success factors James Serra has worked at Microsoft as a big data and data warehousing solution architect over the past nine years. He’s become a thought leader in the use and application of big data and advanced analytics, including data architectures such as the modern data warehouse, data lakehouse, data fabric, and data mesh. 9 7 8 1 0 9 8 1 5 0 7 6 1 5 7 9 9 9 US $79.99 CAN $99.99 ISBN: 978-1-098-15076-1
Praise for Deciphering Data Architectures In Deciphering Data Architectures, James Serra does a wonderful job explaining the evolution of leading data architectures and the trade-offs between them. This book should be required reading for current and aspiring data architects. —Bill Anton, Data Geek, Opifex Solutions James has condensed over 30 years of data architecture knowledge and wisdom into this comprehensive and very readable book. For those who must do the hard work of delivering analytics rather than singing its praises, this is a must-read. —Dr. Barry Devlin, Founder and Principal, 9sight Consulting This reference should be on every data architect’s bookshelf. With clear and insightful descriptions of the current and planned technologies, readers will gain a good sense of how to steer their companies to meet the challenges of the emerging data landscape. This is an invaluable reference for new starters and veteran data architects alike. —Mike Fung, Master Principal Cloud Solution Architect, Oracle Marketing buzz and industry thought-leader chatter have sown much confusion about data architecture patterns. With his depth of experience and skill as a communicator, James Serra cuts through the noise and provides clarity on both long-established data architecture patterns and cutting-edge industry methods. that will aid data practitioners and data leaders alike. Put it on your desk—you’ll reference it often. —Sawyer Nyquist, Owner, Writer, and Consultant, The Data Shop
The world of data architectures is complex and full of noise. This book provides a fresh, practical perspective born of decades of experience. Whether you’re a beginner or an expert, everyone with an interest in data must read this book! —Piethein Strengholt, author of Data Management at Scale An educational gem! Deciphering Data Architectures strikes a perfect balance between simplicity and depth, ensuring that technology professionals at all levels can grasp key data concepts and understand the essential trade-off decisions that really matter when planning a data journey. —Ben Reyes, Cofounder and Managing Partner, ZetaMinusOne LLC I recommend Deciphering Data Architectures as a resource that provides the knowledge to understand and navigate the available options when developing a data architecture. —Mike Shelton, Cloud Solution Architect, Microsoft Data management is critical to the success of every business. Deciphering Data Architectures breaks down the buzzwords into simple and understandable concepts and practical solutions to help you get to the right architecture for your dataset. —Matt Usher, Director, Pure Storage As a consultant and community leader, I often direct people to James Serra’s blog for up- to-date and in-depth coverage of modern data architectures. This book is a great collection, condensing Serra’s wealth of vendor-neutral knowledge. My favorite is Part III, where James discusses the pros and cons of each architecture design. I believe this book will immensely benefit any organization that plans to modernize its data estate. —Teo Lachev, Consultant, Prologika James’s blog has been my go-to resource for demystifying architectural concepts, understanding technical terminology, and navigating the life of a solution architect or data engineer. His ability to transform complex technical concepts into clear, easy-to- grasp explanations is truly remarkable. This book is an invaluable collection of his work, serving as a comprehensive reference guide for designing and comprehending architectures. —Annie Xu, Senior Data Customer Engineer, Google
James’s superpower has always been taking complex subjects and explaining them in a simple way. In this book, he hits all the key points to help you choose the right data architecture and avoid common (and costly!) mistakes. —Rod Colledge, Senior Technical Specialist (Data and AI), Microsoft This book represents a great milestone in the evolution of how we handle data in the technology industry, and how we have handled it over several decades, or what is easily the equivalent of a career for most. The content offers great insights for the next generation of data professionals in terms of what they need to think about when designing future solutions. ‘deciphering’ is certainly an excellent choice of wording for this, as deciphering is exactly when it is needed when turning requirements into data products. —Paul Andrew, CTO, Cloud Formations Consulting A fantastic guide for data architects, this book is packed with experience and insights. Its comprehensive coverage of evolving trends and diverse approaches makes it an essential reference for anyone looking to broaden their understanding of the field. —Simon Whiteley, CTO, Advancing Analytics Limited There is no one whose knowledge of data architectures and data processes I trust more than James Serra. This book not only provides a comprehensive and clear description of key architectural principles, approaches, and pitfalls, it also addresses the all-important people, cultural, and organizational issues that too often imperil data projects before they get going. This book is destined to become an industry primer studied by college students and business professionals alike who encounter data for the first time (and maybe the second and third time as well)! —Wayne Eckerson, President of Eckerson Group Deciphering Data Architectures is an indispensable vendor-neutral guide for today’s data professionals. It insightfully compares historical and modern architectures, emphasizing key trade-offs and decision-making nuances in choosing an appropriate architecture for the evolving data-driven landscape. —Stacia Varga, author and Data Analytics Consultant, Data Inspirations
Deep, practitioner wisdom within, the latest scenarios in the market today have vendor specific skew, latest terminology, and sales options. James takes his many years of expertise to give agnostic, cross cloud, vendor, vertical approaches from small to large. —Jordan Martz, Senior Sales Engineer, Fivetran Data Lake, Data Lakehouse, Data Fabric, Data Mesh … It isn’t easy sorting the nuggets from the noise. James Serra’s knowledge and experience is a great resource for everyone with data architecture responsibilities. —Dave Wells, Industry Analyst, eLearningcurve Too often books are “how-to” with no background or logic – this book solves that. With a comprehensive view of why data is arranged in a certain way, you’ll learn more about the right way to implement the “how.” —Buck Woody, Principal Data Scientist, Microsoft Deciphering Data Architectures is not only thorough and detailed, but it also provides a critical perspective on what works, and perhaps more importantly, what may not work well. Whether discussing older data approaches or newer ones such as Data Mesh, the book offers words of wisdom and lessons learned that will help any data practitioner accelerate their data journey. —Eric Broda, Entrepreneur, Data Consultant, author of Implementing Data Mesh (O’Reilly) No other book I know explains so comprehensively about data lake, warehouse, mesh, fabric and lakehouse! It is a must have book for all data architects and engineers. —Vincent Rainardi, Data Architect and author
James Serra Deciphering Data Architectures Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh Boston Farnham Sebastopol TokyoBeijing
978-1-098-15076-1 [LSI] Deciphering Data Architectures by James Serra Copyright © 2024 James Serra. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Aaron Black Development Editor: Sarah Grey Production Editor: Katherine Tozer Copyeditor: Paula L. Fleming Proofreader: Tove Innis Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea February 2024: First Edition Revision History for the First Release 2024-02-06: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098150761 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Deciphering Data Architectures, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
To the loving memory of my grandparents—Dolly, Bill, Martha, and Bert
(This page has no text content)
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Part I. Foundation 1. Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What Is Big Data, and How Can It Help You? 4 Data Maturity 7 Stage 1: Reactive 8 Stage 2: Informative 8 Stage 3: Predictive 9 Stage 4: Transformative 9 Self-Service Business Intelligence 9 Summary 10 2. Types of Data Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Evolution of Data Architectures 14 Relational Data Warehouse 16 Data Lake 18 Modern Data Warehouse 20 Data Fabric 21 Data Lakehouse 21 Data Mesh 22 Summary 23 ix
3. The Architecture Design Session. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 What Is an ADS? 25 Why Hold an ADS? 26 Before the ADS 27 Preparing 27 Inviting Participants 29 Conducting the ADS 31 Introductions 31 Discovery 31 Whiteboarding 36 After the ADS 37 Tips for Conducting an ADS 38 Summary 40 Part II. Common Data Architecture Concepts 4. The Relational Data Warehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 What Is a Relational Data Warehouse? 43 What a Data Warehouse Is Not 46 The Top-Down Approach 47 Why Use a Relational Data Warehouse? 49 Drawbacks to Using a Relational Data Warehouse 52 Populating a Data Warehouse 53 How Often to Extract the Data 53 Extraction Methods 54 How to Determine What Data Has Changed Since the Last Extraction 54 The Death of the Relational Data Warehouse Has Been Greatly Exaggerated 56 Summary 57 5. Data Lake. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 What Is a Data Lake? 60 Why Use a Data Lake? 60 Bottom-Up Approach 62 Best Practices for Data Lake Design 63 Multiple Data Lakes 69 Advantages 69 Disadvantages 72 Summary 72 x | Table of Contents
6. Data Storage Solutions and Processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Data Storage Solutions 76 Data Marts 76 Operational Data Stores 77 Data Hubs 79 Data Processes 81 Master Data Management 81 Data Virtualization and Data Federation 82 Data Catalogs 87 Data Marketplaces 87 Summary 89 7. Approaches to Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Online Transaction Processing Versus Online Analytical Processing 92 Operational and Analytical Data 94 Symmetric Multiprocessing and Massively Parallel Processing 94 Lambda Architecture 96 Kappa Architecture 98 Polyglot Persistence and Polyglot Data Stores 100 Summary 101 8. Approaches to Data Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Relational Modeling 103 Keys 103 Entity–Relationship Diagrams 104 Normalization Rules and Forms 104 Tracking Changes 106 Dimensional Modeling 107 Facts, Dimensions, and Keys 107 Tracking Changes 108 Denormalization 109 Common Data Model 111 Data Vault 111 The Kimball and Inmon Data Warehousing Methodologies 113 Inmon’s Top-Down Methodology 114 Kimball’s Bottom-Up Methodology 115 Choosing a Methodology 116 Hybrid Models 118 Methodology Myths 120 Summary 123 Table of Contents | xi
9. Approaches to Data Ingestion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 ETL Versus ELT 125 Reverse ETL 127 Batch Processing Versus Real-Time Processing 129 Batch Processing Pros and Cons 130 Real-Time Processing Pros and Cons 130 Data Governance 131 Summary 132 Part III. Data Architectures 10. The Modern Data Warehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 The MDW Architecture 135 Pros and Cons of the MDW Architecture 140 Combining the RDW and Data Lake 142 Data Lake 142 Relational Data Warehouse 142 Stepping Stones to the MDW 143 EDW Augmentation 143 Temporary Data Lake Plus EDW 145 All-in-One 146 Case Study: Wilson & Gunkerk’s Strategic Shift to an MDW 147 Challenge 147 Solution 147 Outcome 148 Summary 148 11. Data Fabric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 The Data Fabric Architecture 152 Data Access Policies 154 Metadata Catalog 154 Master Data Management 155 Data Virtualization 155 Real-Time Processing 155 APIs 155 Services 156 Products 156 Why Transition from an MDW to a Data Fabric Architecture? 156 Potential Drawbacks 157 Summary 157 xii | Table of Contents
12. Data Lakehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Delta Lake Features 160 Performance Improvements 162 The Data Lakehouse Architecture 163 What If You Skip the Relational Data Warehouse? 165 Relational Serving Layer 167 Summary 167 13. Data Mesh Foundation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 A Decentralized Data Architecture 170 Data Mesh Hype 171 Dehghani’s Four Principles of Data Mesh 172 Principle #1: Domain Ownership 172 Principle #2: Data as a Product 173 Principle #3: Self-Serve Data Infrastructure as a Platform 175 Principle #4: Federated Computational Governance 176 The “Pure” Data Mesh 177 Data Domains 178 Data Mesh Logical Architecture 179 Different Topologies 181 Data Mesh Versus Data Fabric 182 Use Cases 183 Summary 185 14. Should You Adopt Data Mesh? Myths, Concerns, and the Future. . . . . . . . . . . . . . . . . 187 Myths 187 Myth: Using Data Mesh Is a Silver Bullet That Solves All Data Challenges Quickly 187 Myth: A Data Mesh Will Replace Your Data Lake and Data Warehouse 188 Myth: Data Warehouse Projects Are All Failing, and a Data Mesh Will Solve That Problem 188 Myth: Building a Data Mesh Means Decentralizing Absolutely Everything 188 Myth: You Can Use Data Virtualization to Create a Data Mesh 189 Concerns 190 Philosophical and Conceptual Matters 190 Combining Data in a Decentralized Environment 191 Other Issues of Decentralization 192 Complexity 193 Duplication 193 Feasibility 194 People 196 Domain-Level Barriers 197 Table of Contents | xiii
Organizational Assessment: Should You Adopt a Data Mesh? 198 Recommendations for Implementing a Successful Data Mesh 199 The Future of Data Mesh 201 Zooming Out: Understanding Data Architectures and Their Applications 202 Summary 203 Part IV. People, Processes, and Technology 15. People and Processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Team Organization: Roles and Responsibilities 208 Roles for MDW, Data Fabric, or Data Lakehouse 208 Roles for Data Mesh 210 Why Projects Fail: Pitfalls and Prevention 213 Pitfall: Allowing Executives to Think That BI Is “Easy” 213 Pitfall: Using the Wrong Technologies 213 Pitfall: Gathering Too Many Business Requirements 213 Pitfall: Gathering Too Few Business Requirements 214 Pitfall: Presenting Reports Without Validating Their Contents First 214 Pitfall: Hiring an Inexperienced Consulting Company 214 Pitfall: Hiring a Consulting Company That Outsources Development to Offshore Workers 215 Pitfall: Passing Project Ownership Off to Consultants 215 Pitfall: Neglecting the Need to Transfer Knowledge Back into the Organization 215 Pitfall: Slashing the Budget Midway Through the Project 215 Pitfall: Starting with an End Date and Working Backward 216 Pitfall: Structuring the Data Warehouse to Reflect the Source Data Rather Than the Business’s Needs 216 Pitfall: Presenting End Users with a Solution with Slow Response Times or Other Performance Issues 216 Pitfall: Overdesigning (or Underdesigning) Your Data Architecture 217 Pitfall: Poor Communication Between IT and the Business Domains 217 Tips for Success 217 Don’t Skimp on Your Investment 217 Involve Users, Show Them Results, and Get Them Excited 218 Add Value to New Reports and Dashboards 219 Ask End Users to Build a Prototype 219 Find a Project Champion/Sponsor 219 Make a Project Plan That Aims for 80% Efficiency 220 Summary 220 xiv | Table of Contents
16. Technologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Choosing a Platform 223 Open Source Solutions 223 On-Premises Solutions 226 Cloud Provider Solutions 227 Cloud Service Models 230 Major Cloud Providers 232 Multi-Cloud Solutions 232 Software Frameworks 235 Hadoop 235 Databricks 238 Snowflake 240 Summary 241 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Table of Contents | xv
(This page has no text content)
Foreword Never in the history of the modern, technology-enabled enterprise has the data land‐ scape evolved so quickly. As the pace of change continues to accelerate, the data eco‐ system becomes far more complex—especially the value chains that connect suppliers and customers to organizations. Data seems to flow everywhere. It has become one of any organization’s most strategic assets, fundamentally underpinning digital transfor‐ mation, automation, artificial intelligence, innovation, and more. This increasing tempo of change intensifies the importance of optimizing your organization’s data architecture to ensure ongoing adaptability, interoperability and maintainability. In this book, James Serra lays out a clear set of choices for the data architect, whether you are seeking to create a resilient design or simply reduce techni‐ cal debt. It seems every knowledge worker has a story of a meeting with data engineers and architects that felt like a Dilbert cartoon, where nobody seemed to speak the same language and the decisions were too complicated and murky. By defining concepts, addressing concerns, dispelling myths, and proposing workarounds for pitfalls, James gives the reader a working knowledge of data architectures and the confidence to make informed decisions. As a leader, I have been pleased with how well the book’s contents help to strengthen my data teams’ alignment by providing them with a com‐ mon vocabulary and references. When I met James around 15 years ago, he was considering expanding his expertise beyond database administration to business intelligence and analytics. His insatiable desire to learn new things and share what he was learning to serve the greater good left a strong impression on me. It still drives him today. The countless blog posts, pre‐ sentations, and speaking engagements through which he shares the depth of his expe‐ rience now culminate in this expansive resource. This book will benefit all of us who handle data as we navigate an uncertain future. xvii
Understanding the core principles of data architecture allows you to ride the waves of change as new data platforms, technology providers, and innovations emerge. Thank‐ fully, James has created a foundational resource for all of us. This book will help you to see the bigger picture and design a bright future where your data creates the com‐ petitive advantage that you seek. — Sean McCall Chief Data Officer, Oceaneering International Houston, December 2023 xviii | Foreword