(This page has no text content)
Statistics for Machine Learning Implement Statistical Methods used in Machine Learning using Python Himanshu Singh www.bpbonline.com
FIRST EDITION 2021 Copyright © BPB Publications, India ISBN: 978-93-88511-97-1 All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any form or by any means or stored in a database or retrieval system, without the prior written permission of the publisher with the exception to the program listings which may be entered, stored and executed in a computer system, but they can not be reproduced by the means of publication, photocopy, recording, or by any electronic and mechanical means. LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY The information contained in this book is true to correct and the best of author’s and publisher’s knowledge. The author has made every effort to ensure the accuracy of these publications, but publisher cannot be held responsible for any loss or damage arising from any information in this book. All trademarks referred to in the book are acknowledged as properties of their respective owners but BPB Publications cannot guarantee the accuracy of this information. Distributors:
BPB PUBLICATIONS 20, Ansari Road, Darya Ganj New Delhi-110002 Ph: 23254990/23254991 MICRO MEDIA Shop No. 5, Mahendra Chambers, 150 DN Rd. Next to Capital Cinema, V.T. (C.S.T.) Station, MUMBAI-400 001 Ph: 22078296/22078297 DECCAN AGENCIES 4-3-329, Bank Street, Hyderabad-500195 Ph: 24756967/24756400
BPB BOOK CENTRE 376 Old Lajpat Rai Market, Delhi-110006 Ph: 23861747 Published by Manish Jain for BPB Publications, 20 Ansari Road, Darya Ganj, New Delhi-110002 and Printed by him at Repro India Ltd, Mumbai www.bpbonline.com
Dedicated to My Dad Who never stopped believing in me, even though he never expressed.
About the Author Himanshu Singh is currently an AI technology lead and senior NLP developer at Legato Health Technologies (An Anthem Company). Himanshu has a total of 7 years of experience, mostly in the domain of Natural Language Processing. He has written five books in the machine learning domain and is a guest faculty for machine learning and data science. Himanshu is an avid blogger and loves to read and write fiction short stories in his free time.
About the Reviewer Aravind Kota is currently working as a data scientist. He has around 3+ years of experience in the field of data science, with specialization in image and text analytics and statistical operations with Python coding. He shares his knowledge in this field through blogs, and it’s important for readers to understand these concepts for further experiments.
Acknowledgements First and foremost, I would like to thank my team. It is because of them that I got the opportunities to explore different problem statements, which has enabled me to write this book. I would especially like to thank Aravind, Bhavani, and Yunis sir. I would also like to thank my students. Because of them, I came across all the doubts that they faced while understanding statistics. This, in turn, gave me ideas to approach this book in such a way that it clears the doubts of its readers. Last but not least, I would like to thank my wife, Shikha. She has been a constant source of motivation for me, and without her, I would have never been able to finish the book.
Preface This book can be considered a preliminary requirement before starting the machine learning journey in detail. One must understand that machine learning, in itself, is dependent on the concepts of statistics and mathematics. Statistical concepts are used in various areas of machine learning, like data exploration, finding the efficiency and efficacy of variables as well as models, and making visualizations. This book is designed in such a way that a reader can go through all the required concepts of statistics and then jump to understanding machine learning algorithms. This book can be said to be having three sections. The first section starts with the basics of statistics. It covers preliminary concepts like mean, median, mode, and such and moves on to the concepts related to probability, random variables, and the like. The second section covers the complex parts of statistics, including advanced concepts like statistical tests, parametric and non-parametric tests and their applications in Python. Finally, the last section talks more about how to use various data science packages in Python and introduces readers to machine learning and some of its algorithms.
Downloading the coloured images: Please follow the link to download the Coloured Images of the book: https://rebrand.ly/vqukb Errata We take immense pride in our work at BPB Publications and follow best practices to ensure the accuracy of our content to provide with an indulging reading experience to our subscribers. Our readers are our mirrors, and we use their inputs to reflect and improve upon human errors, if any, that may have occurred during the publishing processes involved. To let us maintain the quality and help us reach out to any readers who might be having difficulties due to any unforeseen errors, please write to us at : errata@bpbonline.com Your support, suggestions and feedbacks are highly appreciated by the BPB Publications’ Family.
Did you know that BPB offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.bpbonline.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at business@bpbonline.com for more details. At you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on BPB books and eBooks.
BPB is searching for authors like you If you're interested in becoming an author for BPB, please visit www.bpbonline.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea. The code bundle for the book is also hosted on GitHub at In case there's an update to the code, it will be updated on the existing GitHub repository. We also have other code bundles from our rich catalog of books and videos available at Check them out! PIRACY If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at business@bpbonline.com with a link to the material. If you are interested in becoming an author
If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit REVIEWS Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at BPB can understand what you think about our products, and our authors can see your feedback on their book. Thank you! For more information about BPB, please visit
Table of Contents 1. Introduction to Statistics Structure Objectives Population and Sample Introduction to Random Variables Discrete Random Variables Continuous Random Variables Other variables Numerical variables Categorical Variables Introduction to Descriptive Statistics Visualizations Conclusion 2. Descriptive Statistics Structure Objective Measures of Central Tendency Mean (Arithmetic) Median Mode Unimodal data Bimodal data Multimodal data Measures of dispersion Range Quartile
Standard Deviation Standard Deviation vs. Variance The Strength of the relationship between variables Dependent variables Independent variables Covariance Correlation Conclusion 3. Random Variables Structure Objective Random Variables Discrete Random Variables Continuous Random Variables Joint Distributions Independent Random Variables Marginal and Conditional Distributions Definition of Mathematical Expectation Properties of Mathematical Expectation Chebyshev’s Inequality Law of large numbers Conclusion 4. Probability Structure Objective Introduction Properties of probability Intersection of sets
Union of sets The complement of a set Null set Subset/superset Some other terminologies Mutually exclusive events Mutually exhaustive events Commutative laws Associative laws Distributive laws De Morgan’s law Conditional probability Dependent and independent events Bayes’s theorem Probability distributions Binomial distribution Geometric distribution Poisson distribution Normal distribution Conclusion 5. Parameter Estimation Structure Objective Parameter estimation Point estimate – The mathematics way Sampling distributions Central Limit Theorem Estimators having bias component The variance of a point estimate Standard Error of Estimator
Mean Squared Error of Estimator Methods to Determine Point Estimates Method of Moments Maximum Likelihood Method Confidence Intervals Conclusion 6. Hypothesis Testing Structure Objective Hypothesis Hypothesis Testing Confidence Interval Types of Hypothesis Null Hypothesis Alternative Hypothesis P-Value Steps in hypothesis testing Use Case Z-test T-test One-sample T-test Two-sample T-test Paired T-test Chi-Square test Test of Goodnessoffit Independence test Conclusion 7. Analysis of Variance
Structure Objective Introduction to ANOVA One-way ANOVA test Calculation of Mean Square due to Error Calculation of Mean Square due to Treatment Decision Rule Tukey test Two-way ANOVA Main Effects Interaction Effects Multivariate Analysis of Variance (MANOVA) Wilks’ Lambda test Lawley Hotelling Trace Pillai’s Trace Roy’s Largest Root Conclusion 8. Regression Structure Objective Simple Linear Regression Finding the Values of and Standard Error Confidence Intervals Unimportant Variable Accuracy of Prediction Data Pre-processing Multiple Linear Regression Polynomial Regression
Subset Selection Method Ridge Regression Lasso Regression ElasticNet Regression Logistic Regression Estimation of Parameters Understanding Residuals Patterns of Residuals Multicollinearity Conclusion 9. Data Analysis Using Python Structure Objectives Pandas Importing and Reading a CSV Sheet Basic Exploration of Data Converting a Python Data Structure to Data Frame Numerical Description of a Data Frame Adding Conditions in Pandas Extending Extractions – loc and iloc Understanding the iloc() Function Understanding the loc() Function Tackling Null Values Concatenating Data Frames Merging Data Frames Left Join Right Join Outer Join Inner Join
Comments 0
Loading comments...
Reply to Comment
Edit Comment