Previous Next

The Self-Service Data Roadmap Democratize Data and Reduce Time to Insight (Sandeep Uttamchandani) (z-library.sk, 1lib.sk, z-lib.sk)

Author: Sandeep Uttamchandani

科学

Data-driven insights are a key competitive advantage for any industry today, but deriving insights from raw data can still take days or weeks. Most organizations can’t scale data science teams fast enough to keep up with the growing amounts of data to transform. What’s the answer? Self-service data. With this practical book, data engineers, data scientists, and team managers will learn how to build a self-service data science platform that helps anyone in your organization extract insights from data. Sandeep Uttamchandani provides a scorecard to track and address bottlenecks that slow down time to insight across data discovery, transformation, processing, and production. This book bridges the gap between data scientists bottlenecked by engineering realities and data engineers unclear about ways to make self-service work. Build a self-service portal to support data discovery, quality, lineage, and governance Select the best approach for each self-service capability using open source cloud technologies Tailor self-service for the people, processes, and technology maturity of your data platform Implement capabilities to democratize data and reduce time to insight Scale your self-service portal to support a large number of users within your organization

📄 File Format: PDF
💾 File Size: 10.7 MB
9
Views
0
Downloads
0.00
Total Donations

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

📄 Page 1
Dr. Sandeep Uttamchandani The Self-Service Data Roadmap Democratize Data and Reduce Time to Insight
📄 Page 2
(This page has no text content)
📄 Page 3
Dr. Sandeep Uttamchandani The Self-Service Data Roadmap Democratize Data and Reduce Time to Insight Boston Farnham Sebastopol TokyoBeijing
📄 Page 4
978-1-492-07525-7 [LSI] The Self-Service Data Roadmap by Sandeep Uttamchandani Copyright © 2020 Sandeep Uttamchandani. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jessica Haberman Developmental Editor: Corbin Collins Production Editor: Beth Kelly Copyeditor: Holly Bauer Forsyth Proofreader: Piper Editorial, LLC Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: O’Reilly Media, Inc. September 2020: First Edition Revision History for the First Edition 2020-09-10: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492075257 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The Self-Service Data Roadmap, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
📄 Page 5
For my parents; my teacher and mentor, Gul; my wife, Anshul; and my kids, Sohum and Mihika.
📄 Page 6
(This page has no text content)
📄 Page 7
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Journey Map from Raw Data to Insights 3 Discover 4 Prep 6 Build 7 Operationalize 8 Defining Your Time-to-Insight Scorecard 10 Build Your Self-Service Data Roadmap 15 Part I. Self-Service Data Discovery 2. Metadata Catalog Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Journey Map 22 Understanding Datasets 23 Analyzing Datasets 23 Knowledge Scaling 24 Minimizing Time to Interpret 24 Extracting Technical Metadata 24 Extracting Operational Metadata 25 Gathering Team Knowledge 26 Defining Requirements 26 Technical Metadata Extractor Requirements 27 Operational Metadata Requirements 28 Team Knowledge Aggregator Requirements 28 Implementation Patterns 29 v
📄 Page 8
Source-Specific Connectors Pattern 29 Lineage Correlation Pattern 31 Team Knowledge Pattern 32 Summary 33 3. Search Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Journey Map 35 Determining Feasibility of the Business Problem 36 Selecting Relevant Datasets for Data Prep 36 Reusing Existing Artifacts for Prototyping 36 Minimizing Time to Find 37 Indexing Datasets and Artifacts 37 Ranking Results 37 Access Control 38 Defining Requirements 38 Indexer Requirements 39 Ranking Requirements 40 Access Control Requirements 40 Nonfunctional Requirements 40 Implementation Patterns 41 Push-Pull Indexer Pattern 42 Hybrid Search Ranking Pattern 44 Catalog Access Control Pattern 46 Summary 49 4. Feature Store Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Journey Map 52 Finding Available Features 53 Training Set Generation 53 Feature Pipeline for Online Inference 53 Minimize Time to Featurize 53 Feature Computation 54 Feature Serving 54 Defining Requirements 55 Feature Computation 55 Feature Serving 56 Nonfunctional Requirements 57 Implementation Patterns 57 Hybrid Feature Computation Pattern 58 Feature Registry Pattern 60 Summary 62 vi | Table of Contents
📄 Page 9
5. Data Movement Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Journey Map 63 Aggregating Data Across Sources 63 Moving Raw Data to Specialized Query Engines 64 Moving Processed Data to Serving Stores 64 Exploratory Analysis Across Sources 64 Minimizing Time to Data Availability 64 Data Ingestion Configuration and Change Management 65 Compliance 65 Data Quality Verification 65 Defining Requirements 66 Ingestion Requirements 66 Transformation Requirements 68 Compliance Requirements 68 Verification Requirements 69 Nonfunctional Requirements 69 Implementation Patterns 70 Batch Ingestion Pattern 70 Change Data Capture Ingestion Pattern 72 Event Aggregation Pattern 75 Summary 76 6. Clickstream Tracking Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Journey Map 78 Minimizing Time to Click Metrics 79 Managing Instrumentation 80 Event Enrichment 81 Building Insights 82 Defining Requirements 82 Instrumentation Requirements Checklist 82 Enrichment Requirements Checklist 83 Implementation Patterns 84 Instrumentation Pattern 84 Rule-Based Enrichment Patterns 85 Consumption Patterns 87 Summary 89 Part II. Self-Service Data Prep 7. Data Lake Management Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Journey Map 94 Table of Contents | vii
📄 Page 10
Primitive Life Cycle Management 95 Managing Data Updates 96 Managing Batching and Streaming Data Flows 96 Minimizing Time to Data Lake Management 97 Requirements 97 Implementation Patterns 102 Data Life Cycle Primitives Pattern 103 Transactional Pattern 104 Advanced Data Management Pattern 105 Summary 106 8. Data Wrangling Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Journey Map 108 Minimizing Time to Wrangle 109 Defining Requirements 110 Curating Data 110 Operational Monitoring 111 Defining Requirements 111 Implementation Patterns 111 Exploratory Data Analysis Patterns 112 Analytical Transformation Patterns 113 Summary 114 9. Data Rights Governance Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Journey Map 117 Executing Data Rights Requests 117 Discovery of Datasets 118 Model Retraining 118 Minimizing Time to Comply 118 Tracking the Customer Data Life Cycle 118 Executing Customer Data Rights Requests 119 Limiting Data Access 119 Defining Requirements 119 Current Pain Point Questionnaire 120 Interop Checklist 120 Functional Requirements 121 Nonfunctional Requirements 122 Implementation Patterns 122 Sensitive Data Discovery and Classification Pattern 123 Data Lake Deletion Pattern 124 Use Case–Dependent Access Control 125 Summary 127 viii | Table of Contents
📄 Page 11
Part III. Self-Service Build 10. Data Virtualization Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Journey Map 132 Exploring Data Sources 132 Picking a Processing Cluster 132 Minimizing Time to Query 133 Picking the Execution Environment 133 Formulating Polyglot Queries 133 Joining Data Across Silos 134 Defining Requirements 134 Current Pain Point Analysis 134 Operational Requirements 135 Functional Requirements 135 Nonfunctional Requirements 135 Implementation Patterns 136 Automatic Query Routing Pattern 137 Unified Query Pattern 138 Federated Query Pattern 140 Summary 141 11. Data Transformation Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Journey Map 144 Production Dashboard and ML Pipelines 144 Data-Driven Storytelling 144 Minimizing Time to Transform 144 Transformation Implementation 144 Transformation Execution 145 Transformation Operations 145 Defining Requirements 145 Current State Questionnaire 146 Functional Requirements 146 Nonfunctional Requirements 147 Implementation Patterns 147 Implementation Pattern 148 Execution Patterns 151 Summary 152 12. Model Training Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Journey Map 154 Model Prototyping 154 Continuous Training 155 Table of Contents | ix
📄 Page 12
Model Debugging 156 Minimizing Time to Train 156 Training Orchestration 156 Tuning 157 Continuous Training 157 Defining Requirements 158 Training Orchestration 158 Tuning 160 Continuous Training 160 Nonfunctional Requirements 160 Implementation Patterns 161 Distributed Training Orchestrator Pattern 162 Automated Tuning Pattern 163 Data-Aware Continuous Training 164 Summary 166 13. Continuous Integration Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Journey Map 168 Collaborating on an ML Pipeline 168 Integrating ETL Changes 168 Validating Schema Changes 169 Minimizing Time to Integrate 169 Experiment Tracking 169 Reproducible Deployment 170 Testing Validation 170 Defining Requirements 170 Experiment Tracking Module 171 Pipeline Packaging Module 171 Testing Automation Module 172 Implementation Patterns 172 Programmable Tracking Pattern 173 Reproducible Project Pattern 174 Summary 175 14. A/B Testing Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Journey Map 179 Minimizing Time to A/B Test 181 Experiment Design 182 Execution at Scale 182 Experiment Optimization 183 Implementation Patterns 183 Experiment Specification Pattern 184 x | Table of Contents
📄 Page 13
Metrics Definition Pattern 185 Automated Experiment Optimization 185 Summary 186 Part IV. Self-Service Operationalize 15. Query Optimization Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Journey Map 190 Avoiding Cluster Clogs 190 Resolving Runtime Query Issues 190 Speeding Up Applications 191 Minimizing Time to Optimize 191 Aggregating Statistics 191 Analyzing Statistics 192 Optimizing Jobs 193 Defining Requirements 194 Current Pain Points Questionnaire 194 Interop Requirements 195 Functionality Requirements 195 Nonfunctional Requirements 195 Implementation Patterns 196 Avoidance Pattern 196 Operational Insights Pattern 198 Automated Tuning Pattern 200 Summary 201 16. Pipeline Orchestration Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Journey Map 204 Invoke Exploratory Pipelines 205 Run SLA-Bound Pipelines 205 Minimizing Time to Orchestrate 205 Defining Job Dependencies 205 Distributed Execution 206 Production Monitoring 206 Defining Requirements 206 Current Pain Points Questionnaire 207 Operational Requirements 207 Functional Requirements 208 Nonfunctional Requirements 208 Implementation Patterns 209 Dependency Authoring Patterns 209 Table of Contents | xi
📄 Page 14
Orchestration Observability Patterns 211 Distributed Execution Pattern 212 Summary 213 17. Model Deploy Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Journey Map 216 Model Deployment in Production 216 Model Maintenance and Upgrade 216 Minimizing Time to Deploy 217 Deployment Orchestration 217 Performance Scaling 217 Drift Monitoring 218 Defining Requirements 218 Orchestration 218 Model Scaling and Performance 220 Drift Verification 221 Nonfunctional Requirements 221 Implementation Patterns 221 Universal Deployment Pattern 222 Autoscaling Deployment Pattern 224 Model Drift Tracking Pattern 225 Summary 226 18. Quality Observability Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Journey Map 228 Daily Data Quality Monitoring Reports 228 Debugging Quality Issues 228 Handling Low-Quality Data Records 229 Minimizing Time to Insight Quality 229 Verify the Accuracy of the Data 229 Detect Quality Anomalies 230 Prevent Data Quality Issues 231 Defining Requirements 231 Detection and Handling Data Quality Issues 232 Functional Requirements 232 Nonfunctional Requirements 233 Implementation Patterns 233 Accuracy Models Pattern 234 Profiling-Based Anomaly Detection Pattern 235 Avoidance Pattern 236 Summary 238 xii | Table of Contents
📄 Page 15
19. Cost Management Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Journey Map 240 Monitoring Cost Usage 240 Continuous Cost Optimization 241 Minimizing Time to Optimize Cost 241 Expenditure Observability 241 Matching Supply and Demand 242 Continuous Cost Optimization 242 Defining Requirements 243 Pain Points Questionnaire 243 Functional Requirements 243 Nonfunctional Requirements 244 Implementation Patterns 244 Continuous Cost Monitoring Pattern 245 Automated Scaling Pattern 246 Cost Advisor Pattern 247 Summary 249 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Table of Contents | xiii
📄 Page 16
(This page has no text content)
📄 Page 17
Preface Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. xv
📄 Page 18
This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://oreil.ly/ssdr-book. If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “The Self-Service Data Roadmap by Sandeep Uttamchandani (O’Reilly). Copyright 2020 Sandeep Uttam‐ chandani, 978-1-492-07525-7.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. xvi | Preface
📄 Page 19
How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/ssdr. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit http://oreilly.com. Find us on Facebook: http://facebook.com/oreilly. Follow us on Twitter: http://twitter.com/oreillymedia. Watch us on YouTube: http://youtube.com/oreillymedia. Preface | xvii
📄 Page 20
(This page has no text content)
The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00
Total Amount (¥)
0
Donation Count

Login to support the author

Login Now

Recommended for You

Loading recommended books...
Failed to load, please try again later
Back to List