📄 Page
1
M A N N I N G Mona Khalil Foreword by Barry McCardel Hard and soft skills to accelerate your career
📄 Page
3
Effective Data Analysis HARD AND SOFT SKILLS TO ACCELERATE YOUR CAREER MONA KHALIL FOREWORD BY BARRY MCCARDEL M A N N I N G SHELTER ISLAND
📄 Page
4
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2025 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The author and publisher have made every effort to ensure that the information in this book was correct at press time. The author and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Manning Publications Co. Development editor: Elesha Hyde 20 Baldwin Road Technical editor: Ryan Folks PO Box 761 Review editor: Radmila Ercegovac Shelter Island, NY 11964 Production editor: Andy Marinkovich Copy editor: Andy Carroll Proofreader: Mike Beady Technical proofreader: Andrew Freed Typesetter and cover designer: Marija Tudor ISBN 9781633438415 Printed in the United States of America
📄 Page
5
To my mother, who encouraged and inspired all of my ambitions. Thank you for listening. Andrea Khalil, 1958–2023
📄 Page
6
iv brief contents PART 1 ASKING QUESTIONS ......................................................... 1 1 ■ What does an analyst do? 3 2 ■ From question to deliverable 19 3 ■ Testing and evaluating hypotheses 47 PART 2 MEASUREMENT ............................................................. 77 4 ■ Statistics you (probably) learned: T-tests, ANOVAs, and correlations 79 5 ■ Statistics you (probably) missed: Non-parametrics and interpretation 112 6 ■ Are you measuring what you think you’re measuring? 145 7 ■ The art of metrics: Tracking performance for organizational success 177 PART 3 THE ANALYST’S TOOLBOX ............................................ 215 8 ■ Navigating sensitive and protected data 217 9 ■ The world of statistical modeling 249 10 ■ Incorporating external data into analyses 294 11 ■ The magic of well-structured data 326 12 ■ Tools and tech for modern data analytics 360
📄 Page
7
contents foreword x preface xii acknowledgments xiv about this book xvi about the author xix about the cover illustration xx PART 1 ASKING QUESTIONS ......................................... 1 1 What does an analyst do? 3 1.1 What is analytics? 4 Business intelligence 4 ■ Marketing analytics 6 ■ Financial analytics 8 ■ Product analytics 10 ■ How distinct are these fields? 10 1.2 The data analyst’s toolkit 11 Spreadsheet tools 12 ■ Querying language 12 ■ Programming languages for statistics 14 ■ Data visualization tools 15 Adding to your toolkit 16 1.3 Preparing for your role 16 What to expect as an analyst 16 ■ What you will learn in this book 17v
📄 Page
8
CONTENTSvi2 From question to deliverable 19 2.1 The lifecycle of an analytics project 20 Questions and hypotheses 20 ■ Data sources 22 ■ Measures and methods of analysis 27 ■ Interpreting results 29 Exercises 30 2.2 Communicating with stakeholders 30 Guiding the interpretation of results 32 ■ Results that don’t support hypotheses 35 ■ Exercises 39 2.3 Reproducibility 39 What reproducibility is 39 ■ Documenting work 40 Exercises 45 3 Testing and evaluating hypotheses 47 3.1 Informing a hypothesis 48 Collecting background information 49 ■ Constructing your hypothesis 53 ■ Quantifying your hypothesis 54 Exercises 57 3.2 Methods of gathering evidence 57 Descriptions 57 ■ Correlations 62 ■ Experiments 66 Types of study design 69 ■ Exercises 73 3.3 Types of research programs 73 Basic and applied research 73 ■ A/B testing 74 ■ Program evaluation 75 PART 2 MEASUREMENT .............................................. 77 4 Statistics you (probably) learned: T-tests, ANOVAs, and correlations 79 4.1 The logic of summary statistics 80 Summarizing properties of your data 81 ■ Recap 91 Exercises 92 4.2 Making inferences: Group comparisons 93 Parametric tests 94 ■ Exercises 105 4.3 Making inferences: Correlation and regression 106 Correlation coefficients 106 ■ Regression modeling 108 Reporting on correlations and regressions 109 Exercises 110
📄 Page
9
CONTENTS vii5 Statistics you (probably) missed: Non-parametrics and interpretation 112 5.1 The landscape of statistics 113 The evolution of statistical methods 114 ■ Choosing your approaches responsibly 116 5.2 Non-parametric statistics 117 Comparisons between groups on continuous or ordinal data 118 Exercises 129 ■ Comparing categorical data 130 ■ Recap 135 5.3 Responsible interpretation 136 Statistical errors 136 ■ P-hacking 142 6 Are you measuring what you think you’re measuring? 145 6.1 A theory of measurement 146 Translating ideas into concepts 147 ■ Understanding common measurement tools 149 6.2 Choosing a data collection method 151 Types of measures 151 ■ Constructing self-report measures 156 Interpreting available data 163 ■ Exercises 167 6.3 Reliability and validity 167 Reliability 168 ■ Validity 173 ■ Exercises 175 7 The art of metrics: Tracking performance for organizational success 177 7.1 The role of metrics in decision-making 178 Tracking performance 180 ■ Informing organizational strategy 181 ■ Promoting accountability 183 ■ Exercises 185 7.2 The key principles of metric design 185 Using the SMART framework 186 ■ Establishing baselines and targets 191 ■ Exercises 200 7.3 Avoiding metric pitfalls 201 Representation 201 ■ Visualization 207 ■ Exercises 212 PART 3 THE ANALYST’S TOOLBOX ............................ 215 8 Navigating sensitive and protected data 217 8.1 Consent in research 218 A brief history lesson 219 ■ Informed consent 221 ■ Exercises 225
📄 Page
10
CONTENTSviii8.2 The current legal landscape 226 Data protection regulations 226 ■ Bias in automated systems 230 Exercises 235 8.3 Analyzing sensitive data 236 Data minimization 236 ■ Anonymizing and pseudonymizing data 238 ■ Preventing deanonymization 244 ■ Exercises 247 9 The world of statistical modeling 249 9.1 The many faces of statistical modeling 250 Classes of statistical models 252 ■ Model output and diagnostics 260 ■ Exercises 267 9.2 The modeling process 267 Exploratory analysis 269 ■ Fitting a model 276 Exercises 282 9.3 The statistical model and its value 282 Explanatory models 283 ■ Predictive models 286 Exercises 290 10 Incorporating external data into analyses 294 10.1 Using APIs 295 Retrieving data: API vs. browser interface 296 ■ Determining the value of using an API 301 ■ Exercises 303 10.2 Web scraping 304 Scraping the web for data 305 ■ Extracting the data we need 307 Determining the value of web scraping 312 ■ Exercises 313 10.3 Tapping into public data sources 314 When did public data become so popular? 315 ■ Types of public data sources 316 ■ Accessing public data 321 Exercises 324 11 The magic of well-structured data 326 11.1 The analyst’s dilemma 327 11.2 Data tools at a glance 329 Data gathering 330 ■ Data storage 331 ■ Data processing 334 Exercises 337 11.3 Data management practices 338 Data structure 338 ■ Data quality 342 ■ Using metadata 346 Exercises 348
📄 Page
11
CONTENTS ix11.4 Data for analytics 349 Analytics engineering 350 ■ Modeling data in star schemas 353 Exercises 357 12 Tools and tech for modern data analytics 360 12.1 The evolution of data analytics 361 12.2 Analysis and reporting tools 362 Types of tools 363 ■ Libraries of analyses 367 ■ Exercises 372 12.3 Self-service analytics 372 Approaches to self-service 372 ■ Creating a strategy 374 Exercises 375 12.4 Artificial intelligence 376 Generative AI 377 ■ Future directions 382 ■ Final thoughts 382 ■ Exercises 383 references 385 index 389
📄 Page
12
foreword Early in my career, I was conducting a complex analysis as part of a consulting project. I spent weeks collecting and cleaning data and putting together a model that pointed to a clear recommendation. (True confession: if you were confused as to the seem- ingly random pricing of Wi-Fi on a major US airline in the early 2010s, I was partly to blame.) As I walked my manager through it, however, his feedback was somewhat unex- pected: he thought I did a great job on my analysis, but he wanted me to change my methodology, explaining that another angle would “better fit the story the client wants to hear.” What?! As a data analyst, I saw my job as a seeker of objective truth, out to under- stand the world through exacting measurement. Changing my approach because it fits a story someone wanted to hear felt . . . weird. But looking back, I’m not sure it was wrong. The alternative approach was completely valid—it’s not like we were faking data. And just because it was what the client wanted to hear didn’t make it inaccurate; it was just another way to look at the world. Reading through this new book by Mona Khalil, I found myself reflecting on les- sons from a career in data and finding a new appreciation for the science—and art!— of effective data analysis. I believe practitioners of any experience and technicality can benefit from this book and avoid common pitfalls so they can make new, unique mis- takes of their own. Situations like the one I described are incredibly common in the world of data analysis. There’s rarely a straightforward “right” answer to the complex questions that arise in organizations. One often finds situations with incomplete data, varying assumptions, and unclear conclusions. Fusing objective measurement with subjectivex
📄 Page
13
FOREWORD xiinterpretation and thoughtful communication is a much larger part of the role of a data practitioner than most give credit for, and this took me years to fully understand. That learning curve would have come much faster if I had access to this wonderful book. That’s not just because it’s full of helpful technical pointers, methodologies, and exercises (which it certainly is) but rather because it does such an effective job of breaking down how to think about the craft of data analysis, in all its nuance. Mona takes the time to unpack topics like defining metrics and the theory of mea- surement. She breaks down how to think about results that don’t support a hypothe- sis, and how to present findings to stakeholders (including how to debug the kind of situation I found myself in so many years ago). This toolkit is especially important as we enter the exciting and exotic world of AI. It turns out that LLMs can write code pretty well; I certainly wouldn’t recommend aspiring analysts to spend a lot of time memorizing arcane syntax, but I would encour- age them to study how to properly interpret, contextualize, and communicate results—the kind of things that a model may never come to do as well as a human. As a final note, I’ll add that Mona shares the one lesson that every aspiring data sci- entist needs to learn: “simpler” is often “better.” —BARRY MCCARDEL, CEO, HEX TECHNOLOGIES
📄 Page
14
preface I started my data analytics career in 2016 after leaving a PhD program in psychology. In my first role, I was the sole data analyst at a school, responsible for helping the administration make informed decisions and allocate resources effectively. I had to set my own priorities, identify high-value projects, and learn how to communicate effec- tively with various stakeholders. Needless to say, my work involved a lot of trial and error to align with the rest of the organization. To grow my career, I had to actively seek guidance outside of the workplace and learn many lessons on my own. Among data professionals, I hear of similar experiences time and time again. We enter the workforce with a set of technical skills (e.g., statistics, programming) but struggle to apply them to complex real-world scenarios, such as managing stakehold- ers and prioritizing work on our own. We often report to managers who don’t have data analytics experience, have few peers to collaborate with, and spend a large chunk of our time with data literacy education and advocating for the resources we need to perform our jobs effectively. It’s not easy. I wrote this book as a resource for early- to mid-career analysts. It’s written with the type of guidance that a mentor might provide, structured in a way that I (and many colleagues I spoke to) would have considered a valuable reference in their day-to-day work. The world of data analytics can be overwhelming, and it can often seem like we need to figure things out on our own. This book is designed to go beyond the current resources on data analytics (e.g., Python, SQL, statistics) and guide you through the real-world application of these skills. It’s intended to provide clear, actionable advice regardless of your industry or focus. In today’s world where organizations are inundated with data, the need for skilled analysts who can distill meaning from this data has never been greater. By masteringxii
📄 Page
15
PREFACE xiiithe comprehensive range of hard and soft skills covered in this book, you will be well- equipped to choose the right approach to solve complex problems, communicate your findings, and drive strategic decisions that add real value to your organization. Writing this book has been an incredible journey. I had the opportunity to reflect on my career, learn from my past experiences, and reaffirm my belief in the value of learning and sharing knowledge. I hope this book inspires you to take your career to the next level, ask questions of your data, and become a leader at your organization who leaves a positive impact in their wake. To the readers of this book, I wish you the best on your learning journey. While we’re in a dynamic and ever-changing field, the core of our work remains the same. Ask questions, take on new challenges, and never stop learning. Remember that our goal is not just to analyze data, but to drive meaningful change and improvements with our analyses. Thank you for choosing this book as part of your learning journey. I hope it serves to empower you with knowledge and confidence in your career.
📄 Page
16
acknowledgments I started writing this book at a crossroads in my life. I saw the opportunity to dive into a topic dear to my heart, hone my craft, and connect with so many professionals lever- aging data in their day-to-day lives. There are so many people that I would like to thank for their support and contributions to this work. First, I want to thank my dear friend Keshia, who spent countless hours with me on virtual writing sessions. I always appreciate how willing you were to set aside time to lis- ten, share ideas, or simply work together on our respective goals. Thank you for being an amazing, supportive presence in my life. Next, I’d like to thank my partner Srdjan. You’ve been my greatest advocate throughout the writing process. Thank you for watching our dogs so I can get work done, for setting aside time to write together, and for every ounce of encouragement you’ve given me during this process. I look forward to you completing your book as well! Additionally, I want to extend all of my gratitude to my editor at Manning, Elesha. Thank you so much for working with me on this book. I deeply appreciate your guid- ance, your feedback, and your patience with each chapter, ensuring I covered each topic in an engaging and accessible manner. The final version of Effective Data Analysis is so much better thanks to your invaluable contributions. A big thank you to my technical editor, Ryan Folks, whose feedback taught me so much. Ryan’s experience as a data scientist specializing in healthcare analytics within the Department of Anesthesiology at the University of Virginia made him the perfect sounding board during the writing process. Thanks also to Jason Richter, for making sure all of the book’s code was accurate and accessible to readers; Andrew Freed, for making sure every line of code ran perfectly and every detail was on point; andxiv
📄 Page
17
ACKNOWLEDGMENTS xvMichael Stephens at Manning, who believed in me when I pitched a completely differ- ent topic for a book than we initially discussed. You’re all incredible. Finally, I want to thank everyone who took the time to review this book: to Alain Couniot, Andrej Abramušić, Ben McNamara, Cairo Cananéa, Carlos Pavia, David Sha- fer, Diogines Goldoni, Ed Lo, Helen Mary Labao-Barrameda, John Williams, Kristina Kasanicova, Laud Bentil, Maria Ana, Marlin Keys, Martin Czygan, Matthew Copple, Mattia Zoccarato, Murugan Lakshmanan, Nijil Chandran, Richard Vaughan, Sri Ram Macharla, Stefano Ongarello, Tony Dubitsky, Xiangbo Mao, Walter Alexander Mata López, Weronika Burman, and Werner Nindl, your suggestions helped make this a better book.
📄 Page
18
about this book Effective Data Analysis was written to help you strategically apply the skills you’ve learned to the business problems you encounter in your work. It’s designed to help you go beyond the technical and programming components of your work and understand the depth and value you can bring to the teams you collaborate with. This book can help you succeed in job interviews and become familiar with the day-to-day job of an analyst, and you can keep it on your desk as an ongoing reference throughout your career. Who should read this book? This book is primarily written for aspiring and early- or mid-career data professionals in analytics or data science and analytics engineers who answer questions for stake- holders and inform business decisions. We enter the field from a wide variety of back- grounds, giving each of us different strengths and opportunities to grow. There are many resources available on the specific technical aspects of our field (SQL, Python, R, statistics, machine learning), but few resources cover the skills you need to be a stra- tegic partner to your colleagues. This book brings together the depth and nuance you’ll need to know how and where to apply your skills to problems, creating a clear and efficient path to delivering value in your work. How this book is organized: A roadmap This book has 12 chapters organized into 3 parts. Part 1 orients your analytical thinking around asking and answering powerful questions: Chapter 1 covers the areas of focus and specialization in analytics, as well as common tasks and responsibilities associated with each.xvi
📄 Page
19
ABOUT THIS BOOK xvii Chapter 2 covers the process of crafting an analytical question, testing it, creat- ing appropriate deliverables, and ensuring your work can be reproduced by other data professionals. Chapter 3 covers the construction of a hypothesis based on an analytical ques- tion and the process of gathering information and designing a research and analytical framework based on the hypothesis. Part 2 covers the approaches in statistical testing and measurement that can be used to answer analytical questions: Chapter 4 breaks down the most common statistical tests used to answer ques- tions. The underlying logic of each test and the limitations of their use are covered. Chapter 5 discusses aspects of statistics that aren’t included in traditional curric- ula, including their historical development, non-parametric alternatives to com- mon tests, and best practices for using statistics responsibly. Chapter 6 covers techniques used to appropriately translate abstract phenom- ena into measurable constructs. Chapter 7 breaks down the step-by-step strategies needed to design effective organizational metrics and KPIs based on available measures. Part 3 encompasses a range of skills and strategies needed to take your analytics career to the next level: Chapter 8 covers the steps needed to responsibly handle and analyze sensitive and protected data. Chapter 9 describes the process of statistical modeling for different stakeholder deliverables. Chapter 10 discusses the value you can bring to your analyses by obtaining third-party data from APIs, web scraping, and public data sources. Chapter 11 teaches data management strategies, including data engineering stacks and the recently emerging practice of analytics engineering. Chapter 12 reviews modern data analytics tools and resources that can be used to enhance your deliverable’s quality, speed of delivery, and capacity for enabling self-service. About the code This book contains many examples of source code in line with normal text. The source code is formatted in a fixed-width font like this to separate it from ordi- nary text. Code annotations accompany much of the source code, highlighting important concepts. All of the code in this book is written in Python, and it assumes you have an inter- mediate working knowledge of Python for data analysis (e.g., the numpy and pandas
📄 Page
20
ABOUT THIS BOOKxviiilibraries). The code and datasets from chapters 3 through 11 is available for download in an ipynb (Jupyter Notebook) format. Additional code for the chapter 4 and 5 exer- cises are available in separate Jupyter notebooks. You can get executable snippets of code from the liveBook (online) version of this book at https://livebook.manning.com/book/effective-data-analysis. The complete code for the examples in the book is available for download from the Manning web- site at www.manning.com, and from GitHub at https://github.com/mona-kay/ effective-data-analysis. liveBook discussion forum Purchase of Effective Data Analysis includes free access to liveBook, Manning’s online reading platform. Using liveBook’s exclusive discussion features, you can attach com- ments to the book globally or to specific sections or paragraphs. It’s a snap to make notes for yourself, ask and answer technical questions, and receive help from the author and other users. To access the forum, go to https://livebook.manning.com/book/ effective-data-analysis/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/discussion. Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We sug- gest you try asking the author some challenging questions lest their interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website for as long as the book is in print.