Statistics Every Programmer Needs (Gary Sutton)（Z-Library）

M A N N I N G Gary Sutton

2 EPILOGUE Stationary vs. Non-Stationary Time Series Data This panel illustrates the visual distinction between stationary and non-stationary time series. Stationary series exhibit constant mean and variance over time, whereas non-stationary series show trends, changing variance, or other structural shifts. Identifying these properties is essential before fitting time series models that assume stationarity.

Statistics Every Programmer Needs

ii

Statistics Every Programmer Needs GARY SUTTON M A N N I N G SHELTER ISLAND

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2025 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The author and publisher have made every effort to ensure that the information in this book was correct at press time. The author and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Manning Publications Co. Development editor: Ian Hough 20 Baldwin Road Technical editor: Rohit Goswami PO Box 761 Review editor: Angelina Lazukić Shelter Island, NY 11964 Production editor: Keri Hales Copy editor: Tiffany Taylor Proofreader: Jason Everett Typesetter and cover designer: Marija Tudor ISBN 9781633436053 Printed in the United States of America

In memory of my Mom

brief contents 1 ■ Laying the groundwork 1 2 ■ Exploring probability and counting 16 3 ■ Exploring probability distributions and conditional probabilities 41 4 ■ Fitting a linear regression 79 5 ■ Fitting a logistic regression 111 6 ■ Fitting a decision tree and a random forest 140 7 ■ Fitting time series models 184 8 ■ Transforming data into decisions with linear programming 222 9 ■ Running Monte Carlo simulations 242 10 ■ Building and plotting a decision tree 271 11 ■ Predicting future states with Markov analysis 294 12 ■ Examining and testing naturally occurring number sequences 319 13 ■ Managing projects 349 14 ■ Visualizing quality control 378vi

contents preface xiii acknowledgments xv about this book xvii about the author xxiv about the cover illustration xxv 1 Laying the groundwork 1 1.1 Stats and quant 2 Understanding the basics 2 ■ Why they matter 2 ■ The broader effect 3 ■ Diving deeper: Core concepts 3 1.2 Why Python? 4 Rich ecosystem 4 ■ Ease of learning 5 ■ Online support and community 5 ■ Industry adoption 5 ■ Versatility 6 1.3 Python IDEs 6 IDLE: A starting point 6 ■ PyCharm: A professional tool 7 Other popular IDEs 7 1.4 Benefits and learning approach 8 From statistical measures to real-world application 9 Expanding beyond traditional techniques 9 ■ A balanced approach to theory and practice 10vii

CONTENTSviii1.5 How this book works 11 Foundational learning with exploration and practice 12 Using Python for precision and efficiency 13 ■ Adaptable learning for diverse skill levels 14 1.6 What this book does not cover 14 2 Exploring probability and counting 16 2.1 Basic probabilities 17 Probability types 19 ■ Converting and measuring probabilities 19 2.2 Counting rules 22 Multiplication rule 22 ■ Addition rule 22 ■ Combinations and permutations 23 2.3 Continuous random variables 30 Examples 31 ■ Probability density function 31 ■ Cumulative distribution function 33 2.4 Discrete random variables 34 Examples 35 ■ Probability mass function 36 ■ Cumulative distribution function 37 3 Exploring probability distributions and conditional probabilities 41 3.1 Probability distributions 42 Normal distribution 42 ■ Binomial distribution 50 Discrete uniform distribution 56 ■ Poisson distribution 60 3.2 Probability problems 63 Complement rule for probability 65 ■ Quick reference guide 66 ■ Applied probability: Examples and solutions 68 3.3 Conditional probabilities 72 Examples 72 ■ Conditional probabilities and independence 73 Intuitive approach to conditional probability 74 ■ Formulaic approach to conditional probability 76 4 Fitting a linear regression 79 4.1 Primer on linear regression 81 Linear equation 81 ■ Goodness of fit 85 ■ Conditions for best fit 85

CONTENTS ix4.2 Simple linear regression 87 Importing and exploring the data 88 ■ Fitting the model 93 Interpreting and evaluating the results 95 ■ Testing model assumptions 105 5 Fitting a logistic regression 111 5.1 Logistic regression vs. linear regression 113 5.2 Multiple logistic regression 114 Importing and exploring the data 115 ■ Fitting the model 125 Interpreting and evaluating the results 128 ■ Calculating and evaluating classification metrics 131 6 Fitting a decision tree and a random forest 140 6.1 Understanding decision trees and random forests 141 6.2 Importing, wrangling, and exploring the data 142 Understanding the data 143 ■ Wrangling the data 144 Exploring the data 148 6.3 Fitting a decision tree 157 Splitting the data 158 ■ Fitting the model 160 ■ Predicting responses 161 ■ Evaluating the model 161 ■ Plotting the decision tree 163 ■ Interpreting and understanding decision trees 164 ■ Advantages and disadvantages of decision trees 173 6.4 Fitting a random forest 174 Fitting the model 175 ■ Predicting responses 177 ■ Evaluating the model 177 ■ Feature importance 179 ■ Extracting random trees 181 7 Fitting time series models 184 7.1 Distinguishing forecasts from predictions 185 7.2 Importing and plotting the data 186 Fetching financial data 186 ■ Understanding the data 189 Plotting the data 190 7.3 Fitting an ARIMA model 191 Autoregression (AR) component 192 ■ Integration (I) component 192 ■ Moving average (MA) component 192 Combining ARIMA components 192 ■ Stationarity 193 Differencing 195 ■ Stationarity and differencing applied 197 AR and MA components 205 ■ Fitting the model 207 Evaluating model fit 209 ■ Forecasting 213

CONTENTSx7.4 Fitting exponential smoothing models 215 Model structure 216 ■ Applicability 216 ■ Mathematical properties 216 ■ Types of exponential smoothing models 216 Choosing between ARIMA and exponential smoothing 217 SES and DES models 217 ■ Holt–Winters model 218 8 Transforming data into decisions with linear programming 222 8.1 Problem formulation 223 The scenario 224 ■ The challenge 224 ■ The approach 225 Feature summaries 227 8.2 Developing the linear optimization framework 229 Explanation of linear equations and inequalities 230 Data definition 230 ■ Objective function 232 Constraints 233 ■ Decision variable bounds 236 ■ Solving the linear programming problem 236 ■ Result evaluation 239 9 Running Monte Carlo simulations 242 9.1 Applications and benefits of Monte Carlo simulations 243 9.2 Step-by-step process 244 9.3 Hands-on approach 246 Establishing a probability distribution (step 1) 246 ■ Computing a cumulative probability distribution (step 2) 248 ■ Establishing an interval of random numbers for each variable (step 3) 250 Generating random numbers (step 4) 252 ■ Simulating a series of trials (step 5) 253 ■ Analyzing the results (step 6) 254 9.4 Automating simulations on discrete data 255 Plotting and analyzing the results 257 9.5 Automating simulations on continuous data 259 Predicting stock prices with Monte Carlo simulations 259 Analyzing historical data (step 1) 261 ■ Calculating log returns (step 2) 262 ■ Computing statistical parameters (step 3) 264 Generating random daily returns (step 4) 265 ■ Simulating prices (step 5) 266 ■ Simulating multiple trials (step 6) 267 Analyzing the results (step 7) 268 10 Building and plotting a decision tree 271 10.1 Decision-making without probabilities 272 Maximax method 273 ■ Maximin method 276 ■ Minimax Regret method 277 ■ Expected Value method 279

CONTENTS xi10.2 Decision trees 282 Creating the schema 283 ■ Plotting the tree 289 11 Predicting future states with Markov analysis 294 11.1 Understanding the mechanics of Markov analysis 295 11.2 States and state probabilities 296 Understanding the vector of state probabilities for multistate systems 297 ■ Matrix of transition probabilities 300 11.3 Equilibrium conditions 307 Predicting equilibrium conditions programmatically 308 11.4 Absorbing states 311 Obtaining the fundamental matrix 313 ■ Predicting absorbing states 315 ■ Predicting absorbing states programmatically 316 12 Examining and testing naturally occurring number sequences 319 12.1 Benford’s law explained 320 12.2 Naturally occurring number sequences 324 12.3 Uniform and random distributions 325 Uniform distribution 325 ■ Random distribution 327 Plotted distributions 328 12.4 Examples 330 Street addresses 330 ■ World population figures 333 Payment amounts 336 12.5 Validating Benford’s law 337 Chi-square test 338 ■ Mean absolute deviation 341 Distortion factor and z-statistic 343 ■ Mantissa statistics 344 13 Managing projects 349 13.1 Creating a work breakdown structure 350 13.2 Estimating activity times with PERT 354 13.3 Finding the critical path 357 Earliest times 357 ■ Latest times 359 ■ Slack 360 Finding the critical path programmatically 362 13.4 Estimating the probability of project completion 369 13.5 Crashing the project 374

xii14 Visualizing quality control 378 14.1 Quality control measures 380 Upper control limit and lower control limit 380 ■ Mean and center line 381 ■ Standard deviation 381 ■ Range 382 ■ Sample size 383 ■ Proportion defective 383 ■ Number of defective items 384 ■ Number of defects 385 ■ Defects per unit 385 Moving range 386 ■ z-score 386 ■ Process capability indices 387 14.2 Control charts for attributes 388 p-charts 388 ■ np-charts 392 ■ c-charts 394 ■ g-charts 396 14.3 Control charts for variables 398 x-bar charts 400 ■ r-charts 401 ■ s-charts 403 ■ I-MR charts 405 ■ EWMA charts 407 index 411

preface Data-driven decision-making has become a cornerstone of modern business, technol- ogy, and scientific research. Whether predicting financial trends, detecting potentially fraudulent activity, or managing large-scale projects, quantitative methods provide the foundation for solving complex problems with confidence. Yet, too often, critical deci- sions are made without using these powerful techniques, relying instead on intuition, tradition, or incomplete analysis. This needs to change. The ability to apply statistical reasoning and optimization methods is no longer a specialized skill: it is an essential competency for professionals in every data-driven field. This book was born out of a need to bridge the gap between theoretical statistics and practical implementation—particularly for those who work with data but may not have a formal background in statistical modeling. My career has revolved around using statistical and analytical techniques to drive business intelligence and opera- tional improvements. Over the years, I have seen firsthand how programmers, ana- lysts, and professionals in various fields benefit from a deeper understanding of statistics—not just as a theoretical discipline but as a toolkit for solving real-world problems. Yet many resources either focus too heavily on mathematical derivations without application or provide code without sufficient explanation of the underlying principles. This book aims to strike a balance, offering both the “how” and the “why” behind each technique. The idea for this book took shape as I noticed the increasing demand for statistical and machine learning techniques in business, finance, and engineering. Companies were hiring data scientists and analysts in record numbers, but many professionals found themselves needing to apply advanced methods without a structured way to learn them. More and more, I have seen practitioners who can write Python scripts and present results to leadership but lack a deep understanding of what is happeningxiii

PREFACExivunder the hood. This superficial knowledge can lead to misinterpretations, poor model assumptions, and flawed decision-making. Knowing how to apply statistical methods is important—but understanding when, why, and under what conditions they work is critical. The book covers a range of topics essential for any data-driven professional, begin- ning with foundational probability theory and moving through regression analysis, decision trees, Monte Carlo simulations, and Markov chains. Later chapters explore project management and quality control—areas where quantitative methods play a crucial role in ensuring efficiency and reliability. Although Python is used throughout the book as a computational tool, this is not just a Python book; it is a guide to using quantitative methods effectively, providing reusable code alongside clear explanations to ensure that you understand the concepts behind the calculations. A key focus of this book is demonstrating how these techniques are applied in practice. For instance, you will learn how to fit predictive models, optimize decisions using constrained optimization, simulate outcomes with Monte Carlo methods, and analyze patterns in naturally occurring number sequences. The book also emphasizes the importance of statistical rigor, showing when and how to validate results to avoid misleading conclusions. Whether you are a programmer looking to enhance your statistical knowledge, a business professional making data-driven decisions, or a student seeking a structured way to approach quantitative methods, I hope this book serves as a valuable resource. By the time you finish, you will not only know how to apply these techniques but also have a deeper appreciation for the mathematical principles that underpin them. More importantly, you will gain the ability to make data-driven decisions with confi- dence—ensuring that complex problems are approached with the clarity and preci- sion they deserve.

acknowledgments This book, like any meaningful project, was not the work of one person alone. I’m grateful to the many collaborators, reviewers, and supporters who contributed their wisdom, time, and encouragement throughout this journey. First and foremost, I’m deeply grateful to my development editor, Ian Hough, whose insight, candor, and steady guidance shaped this book at almost every stage. I was fortunate to work with him on my first book and even more so this time around. I’m thankful to Marjan Bace, Manning’s publisher, and Andy Waldron, acquisi- tions editor, for giving me the opportunity to write a second book and apply the many lessons I learned from the first. A special thank you to my technical editor, Rohit Goswami, a Software Engineer (II) at Quansight Labs and a Rannis-funded doctoral researcher at the Science Insti- tute of the University of Iceland, whose sharp eye and deep expertise helped ensure that every technical detail was sound and every explanation accessible. Thanks to Aleks Dragosavljevic, who coordinated the peer review process and no doubt handled countless other behind-the-scenes tasks that helped moved this book forward. Huge thanks to the production team—especially Azra Dedic, Angelina Lazukić, Marija Tudor, Keri Hales, Tiffany Taylor, and Jason Everett—for their expert handling of the many moving parts involved in bringing this book to life. Copyediting, typeset- ting, and proofreading a manuscript filled with code, equations, and plots is no small feat, and their attention to detail made all the difference. I’m sincerely grateful to the peer reviewers who dedicated many hours to reading the manuscript and offering thoughtful, constructive feedback: Aastha Joshi, Aditi Godbole, Ajay Tanikonda, Akshay Phadke, Alireza Aghamohammadi, Anmolikaxv

ACKNOWLEDGMENTSxviSingh, Anupam Samanta, Ariel Andres, Arun Moorthy, Christian Sutton, Clemens Baader, Edgar Hassler, Eduardo Rienzi, Georg Sommer, Jatinder Singh, Ken W. Alger, Kevin Middleton, Krishna Gandhi, Mahima Bansod, Marius Radu, Monisha Athi Kesa- van Premalatha, Pankaj Verma, Praveen Gupta Sanka, Purva Bangad, Steven Fernan- dez, Su Liu, Swati Tyagi, and Tony Dubitsky. Your suggestions helped make this a better book. And finally, I want to thank my wife, Liane, who once again tolerated the wide vari- ance in my mood swings while I wrote—and while I overthought what to write. Her patience remains unmatched.

about this book This book is designed to provide you with a strong foundation in statistics and the practical application of quantitative techniques for real-world problem-solving. At its core, the book is about equipping you to make data-driven decisions in complex sce- narios where resources are limited and choices carry significant consequences. By blending theoretical underpinnings with hands-on Python implementations, the text demystifies essential statistical and computational methods while helping you under- stand when and why to apply each technique. Through carefully constructed chapters on regression and other models, optimization, simulation, and more, the book ensures that you gain not only the tools to solve quantitative problems but also the intuition to select the right approach for the right challenge. What sets this book apart is its commitment to a dual focus on theory and practice. Unlike texts that focus exclusively on statistical formulas or coding recipes, this book bridges the gap, explaining the “nuts and bolts” behind each method and providing reusable Python code that uses popular libraries like pandas, NumPy, Scikit-learn, statsmodels, SciPy, Matplotlib, and more. It caters to practitioners and college stu- dents with prior exposure to statistics who are eager to deepen their understanding of quantitative techniques and their applications. You will learn to automate tasks like running regressions or Monte Carlo simulations as you develop a critical understand- ing of the algorithms. Who should read this book This book is designed for professionals and students who want to develop a strong foundation in statistics and quantitative methods while using Python for implementa- tion. Whether you’re an analyst, data scientist, project manager, or researcher, thisxvii

ABOUT THIS BOOKxviiibook provides the essential techniques needed to make data-driven decisions, solve complex optimization problems, and implement predictive models. A background in Python is helpful but not required. Although familiarity with Python will certainly make it easier to follow along with the code examples, the book does not assume advanced programming skills. Every code snippet is accompanied by explana- tions that clarify not only how to implement a technique but also why and how it works. If you’re new to Python, you can still benefit from the statistical and quantita- tive concepts presented, as long as you’re willing to experiment with the code and explore Python as you progress. Some prior exposure to statistics and basic quantitative methods is beneficial. The book assumes a basic understanding of concepts like means, variances, probability, and sim- ple regression, but it reinforces and builds on these foundations. If you’ve taken an introductory statistics course or have experience analyzing data, you’ll find this book a practical guide to applying and extending those skills. This book is particularly well-suited for business analysts and data scientists looking to enhance their statistical modeling and decision-making skills with Python; engineers, researchers, and finance professionals who need quantitative methods to optimize pro- cesses, forecast trends, or assess risks; project managers and operations specialists who want to incorporate data-driven decision-making into their planning, resource allocation, and risk management strategies; students in business analytics, operations research, data sci- ence, and applied mathematics who seek a structured way to learn and apply statistical and quantitative techniques; and anyone transitioning into a data-driven role who wants to gain practical experience with statistical modeling, simulation techniques, and optimi- zation methods. The book covers a wide array of topics, including probability distributions, regres- sion analysis, Monte Carlo simulations, Markov chains, decision trees, and constrained optimization. It also explores practical applications such as detecting fraud with Ben- ford’s law, improving project scheduling with PERT and CPM, and assessing quality control using statistical charts. For the most part, there is a one-to-one relationship between chapters and Python scripts, allowing you to download and execute code easily. Whether you’re following the book sequentially or jumping to a specific topic of interest, you’ll have access to well-structured Python scripts that support the concepts discussed. By the end of this book, you will not only have gained proficiency in implementing quantitative tech- niques with Python but also developed a deeper understanding of the mathematical principles behind them, ensuring that your analyses are both technically sound and practically impactful. How this book is organized: A road map This book is structured into 14 chapters, each designed to provide a focused explora- tion of a specific quantitative technique. For the most part, each technique is intro- duced, explained, and applied within a single chapter, ensuring that concepts are

Statistics

Uploader

Statistics Every Programmer Needs (Gary Sutton)（Z-Library）

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

Statistics

Uploader

Statistics Every Programmer Needs (Gary Sutton)（Z-Library）

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Reply to Comment

Edit Comment

Recommended for You