📄 Page
1
(This page has no text content)
📄 Page
2
Practical Data Science with Jupyter Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter Prateek Gupta www.bpbonline.com
📄 Page
3
FIRST EDITION 2019 SECOND EDITION 2021 Copyright © BPB Publications, India ISBN: 978-93-89898-064 All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any form or by any means or stored in a database or retrieval system, without the prior written permission of the publisher with the exception to the program listings which may be entered, stored and executed in a computer system, but they can not be reproduced by the means of publication, photocopy, recording, or by any electronic and mechanical means. LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY The information contained in this book is true to correct and the best of author’s and publisher’s knowledge. The author has made every effort to ensure the accuracy of these publications, but publisher cannot be held responsible for any loss or damage arising from any information in this book. All trademarks referred to in the book are acknowledged as properties of their respective owners but BPB Publications cannot guarantee the accuracy of this information.
📄 Page
4
Distributors: BPB PUBLICATIONS 20, Ansari Road, Darya Ganj New Delhi-110002 Ph: 23254990/23254991 MICRO MEDIA Shop No. 5, Mahendra Chambers, 150 DN Rd. Next to Capital Cinema, V.T. (C.S.T.) Station, MUMBAI-400 001 Ph: 22078296/22078297 DECCAN AGENCIES 4-3-329, Bank Street, Hyderabad-500195
📄 Page
5
Ph: 24756967/24756400 BPB BOOK CENTRE 376 Old Lajpat Rai Market, Delhi-110006 Ph: 23861747 Published by Manish Jain for BPB Publications, 20 Ansari Road, Darya Ganj, New Delhi-110002 and Printed by him at Repro India Ltd, Mumbai www.bpbonline.com
📄 Page
6
Dedicated to All Aspiring Data Scientists Who have chosen to solve this world’s problem with data
📄 Page
7
About the Author Prateek Gupta is a seasoned Data Science professional with nine years of experience in finding patterns, applying advanced statistical methods and algorithms to uncover hidden insights. His data-driven solutions maximize revenue, profitability, and ensure efficient operations management. He has worked with several multinational IT giants like HCL, Zensar, and Sapient. He is a self-starter and committed data enthusiast with expertise in fishing, winery, and e-commerce domain. He has helped various clients with his machine learning expertise in automatic product categorization, sentiment analysis, customer segmentation, product recommendation engine, and object detection and recognition models. He is a firm believer in “Hard work triumphs talent when talent doesn’t work hard”. His keen area of interest is in the areas of cutting-edge research papers on machine learning and applications of natural language processing with computer vision techniques. In his leisure time, he enjoys sharing knowledge through his blog and motivates young minds to enter the exciting world of Data Science. His Blog: http://dsbyprateekg.blogspot.com/ His LinkedIn Profile: www.linkedin.com/in/prateek-gupta-64203354
📄 Page
8
Acknowledgement I would like to thank some of the brilliant knowledge sharing minds - Jason Brownlee Ph.D., Adrian Rosebrock, Ph.D., and Andrew Ng, from whom I have learned and am still learning many concepts. I would also like to thank open data science community, Kaggle and various data science bloggers for making data science and machine learning knowledge available to everyone. I would also like to express my gratitude to almighty God, my parents, my wife Pragya, and my brother Anubhav, for being incredibly supportive throughout my life and for the writing of this book. Finally, I would like to thank the entire BPB publications team, who made this book possible. Many thanks to Manish Jain, Nrip Jain, and Varun Jain for giving me the opportunity to write my second book.
📄 Page
9
Preface Today, Data Science has become an indispensable part of every organization, for which employers are willing to pay top dollars to hire skilled professionals. Due to the rapidly changing needs of industry, data continues to grow and evolve, thereby increasing the demand for data scientists. However, the questions that continuously haunt every company – are there enough highly- skilled individuals who can analyze how much data will be available, where it will come from, and what the advancement are in analytical techniques to serve them more significant insights? If you have picked up this book, you must have already come across these topics through talks or blogs from several experts and leaders in the industry. To become an expert in any field, everyone must start from a point to learn. This book is designed with keeping such perspective in mind, to serve as your starting point in the field of data science. When I started my career in this field, I had little luck finding a compact guide that I could use to learn concepts of data science, practice examples, and revise them when faced with similar problems at hand. I soon realized Data Science is a very vast domain, and having all the knowledge in a small version of a book is highly impossible. Therefore, I decided I accumulate my experience in the form of this book, where you’ll gain essential knowledge and skill set required to become a data scientist, without wasting your valuable time finding material scattered across the internet.
📄 Page
10
I planned the chapters of this book in a chained form. In the first chapter, you will be made familiar with the data and the new data science skills set. The second chapter is all about setting up tools for the trade with the help of which you can practice the examples discussed in the book. In chapters three to six, you will learn all types of data structures in Python, which you will use in your day-to-day data science projects. In 7th chapter you will lean how to interact with different databases with Python. The eighth- chapter of this book will teach you the most used statistical concepts in data analysis. By the ninth chapter, you will be all set to start your journey of becoming a data scientist by learning how to read, load, and understand different types of data in Jupyter notebook for analysis. The tenth and eleventh chapters will guide you through different data cleaning and visualizing techniques. From the twelfth chapter onwards, you will have to combine knowledge acquired from previous chapters to do data pre- processing of real-world use-cases. In chapters thirteen and fourteen, you will learn supervised and unsupervised machine learning problems and how to solve them. Chapters fifteen and sixteen will cover time series data and will teach you how to handle them. After covering the key concepts, I have included four different case studies, where you will apply all the knowledge acquired and practice solving real-world problems. The last three chapters of this book will make you industry-ready data scientists. Using best practices while structuring your project and use of GitHub repository along with your Data Science concepts will not make you feel naive, while working with other software engineering team.
📄 Page
11
The book you are holding is my humble effort to not only cover fundamentals of Data Science using Python, but also save your time by focusing on minimum theory + more practical examples. These practical examples include real-world datasets and real problems, which will make you confident in tackling similar or related data problems. I hope you find this book valuable, and that it enables you to extend your data science knowledge as a practitioner in a short time.
📄 Page
12
Downloading the coloured images: Please follow the link to download the Coloured Images of the book: https://rebrand.ly/75823 Errata We take immense pride in our work at BPB Publications and follow best practices to ensure the accuracy of our content to provide with an indulging reading experience to our subscribers. Our readers are our mirrors, and we use their inputs to reflect and improve upon human errors, if any, that may have occurred during the publishing processes involved. To let us maintain the quality and help us reach out to any readers who might be having difficulties due to any unforeseen errors, please write to us at : errata@bpbonline.com Your support, suggestions and feedbacks are highly appreciated by the BPB Publications’ Family.
📄 Page
13
Did you know that BPB offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.bpbonline.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at business@bpbonline.com for more details. At you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on BPB books and eBooks.
📄 Page
14
BPB is searching for authors like you If you're interested in becoming an author for BPB, please visit www.bpbonline.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea. The code bundle for the book is also hosted on GitHub at In case there's an update to the code, it will be updated on the existing GitHub repository. We also have other code bundles from our rich catalog of books and videos available at Check them out! PIRACY If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at business@bpbonline.com with a link to the material. If you are interested in becoming an author
📄 Page
15
If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit REVIEWS Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at BPB can understand what you think about our products, and our authors can see your feedback on their book. Thank you! For more information about BPB, please visit
📄 Page
16
Table of Contents 1. Data Science Fundamentals Structure Objective What is data? Structured data Unstructured data Semi-structured data What is data science? What does a data scientist do? Real-world use cases of data science Why Python for data science? Conclusion 2. Installing Software and System Setup Structure Objective System requirements Downloading Anaconda Installing the Anaconda on Windows Installing the Anaconda in Linux How to install a new Python library in Anaconda? Open your notebook – Jupyter Know your notebook Conclusion 3. Lists and Dictionaries Structure
📄 Page
17
Objective What is a list? How to create a list? Different list manipulation operations Difference between Lists and Tuples What is a Dictionary? How to create a dictionary? Some operations with dictionary Conclusion 4. Package, Function, and Loop Structure Objective The help() function in Python How to import a Python package? How to create and call a function? Passing parameter in a function Default parameter in a function How to use unknown parameters in a function? A global and local variable in a function What is a Lambda function? Understanding main in Python while and for loop in Python Conclusion 5. NumPy Foundation Structure Objective Importing a NumPy package Why use NumPy array over list?
📄 Page
18
NumPy array attributes Creating NumPy arrays Accessing an element of a NumPy array Slicing in NumPy array Array concatenation Conclusion 6. Pandas and DataFrame Structure Objective Importing Pandas Pandas data structures Series DataFrame .loc[] and .iloc[] Some Useful DataFrame Functions Handling missing values in DataFrame Conclusion 7. Interacting with Databases Structure Objective What is SQLAlchemy? Installing SQLAlchemy package How to use SQLAlchemy? SQLAlchemy engine configuration Creating a table in a database Inserting data in a table Update a record How to join two tables
📄 Page
19
Inner join Left join Right join Conclusion 8. Thinking Statistically in Data Science Structure Objective Statistics in data science Types of statistical data/variables Mean, median, and mode Basics of probability Statistical distributions Poisson distribution Binomial distribution Normal distribution Pearson correlation coefficient Probability Density Function (PDF) Real-world example Statistical inference and hypothesis testing Conclusion 9. How to Import Data in Python? Structure Objective Importing text data Importing CSV data Importing Excel data Importing JSON data Importing pickled data
📄 Page
20
Importing a compressed data Conclusion 10. Cleaning of Imported Data Structure Objective Know your data Analyzing missing values Dropping missing values Automatically fill missing values How to scale and normalize data? How to parse dates? How to apply character encoding? Cleaning inconsistent data Conclusion 11. Data Visualization Structure Objective Bar chart Line chart Histograms Scatter plot Stacked plot Box plot Conclusion 12. Data Pre-processing Structure Objective