Data Science Fundamentals and Practical Approaches Understand Why Data Science Is the Next (Dr. Gypsy Nandi, Dr. Rupam Kumar Sharma) (z-library.sk, 1lib.sk, z-lib.sk)

(This page has no text content)

Data Science Fundamentals and Practical Approaches Understand Why Data Science is the Next by Dr. Gypsy Nandi Dr. Rupam Kumar Sharma FIRST EDITION 2020 Copyright © BPB Publications, India ISBN: 978-93-89845-662 All Rights Reserved. No part of this publication may be reproduced or distributed in any form or by any means or stored in a database or retrieval system, without the prior written permission of the publisher with the exception to the program listings which may be entered, stored and executed in a computer system, but they can not be reproduced by the means of publication. LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY The information contained in this book is true to correct and the best of author’s & publisher’s knowledge. The author has made every effort to ensure the accuracy of these publications, but cannot be held responsible for any loss or damage arising from any information in this book. All trademarks referred to in the book are acknowledged as properties of their respective owners. Distributors: BPB PUBLICATIONS 20, Ansari Road, Darya Ganj New Delhi-110002 Ph: 23254990/23254991 MICRO MEDIA Shop No. 5, Mahendra Chambers, 150 DN Rd. Next to Capital Cinema, V.T. (C.S.T.) Station, MUMBAI-400 001 Ph: 22078296/22078297 DECCAN AGENCIES 4-3-329, Bank Street, Hyderabad-500195 Ph: 24756967/24756400

BPB BOOK CENTRE 376 Old Lajpat Rai Market, Delhi-110006 Ph: 23861747 Published by Manish Jain for BPB Publications, 20 Ansari Road, Darya Ganj, New Delhi-110002 and Printed by him at Repro India Ltd, Mumbai Dedicated to All the Enthusiastic Learners Who Thrive on Exploring and Learning for Knowledge Building About the Authors Dr. Gypsy Nandi is an Assistant Professor (Sr.) in the Department of Computer Applications, Assam Don Bosco University, India. Her areas of interest include Data Science, Social Network Mining, and Machine Learning. She has completed her Ph.D. in the field of ‘Social Network Analysis and Mining.’Her research scholars are currently working mainly in the field of Data Science. She has several research publications in reputed journals and book series. She is also actively involved in mentoring students for consultancy-based competitions and startup funds. She has won the national-level Smart India Hackathon and the state-level Ideathon many times and has successfully carried out two sanctioned consultancy- based government projects funded by AICTE and UNDP. She has also co-authored a book on “Soft Computing – Fundamentals and Practical Approaches,” published by Studium Press. With her teaching experience of more than 15 years, she is actively involved in various research work related to Data Science and Social Media Analytics. Dr. Rupam Kumar Sharma is an Assistant Professor (Sr.) in the Department of Computer Applications, Assam Don Bosco University, India. His area of interest includes Machine Learning, Data Analytics, Network, and Cyber Security. He has several research publications in reputed SCI and Scopus journals. He has also delivered lectures and trained hundreds of trainees and students across different institutes in the field of security and android app development. He has also co-authored a book on “Soft Computing – Fundamentals and Practical Approaches,” published by Studium Press. He is a good mentor to students, and many of his students are working in a reputed research and corporate institutes abroad. He also has added to his list research grants from DSIR(Department of Scientific and Industrial Research). With more than ten years of teaching experience, he is a passionate teacher and welcomes community and group work with young minds. Acknowledgement We thank the Almighty, whose blessings are always endowed on us and have enabled us to remain determined and focused to overcome with courage every hurdle that came in our way. Deep from the heart, we revere every moral support that is extended to us by our family members at all times said and continue to do so from time immemorial. We also acknowledge the respect and learning endurance of students that have motivated us to write this book for a global reach of knowledge to every aspiring student. Lastly, we thank BPB Publications for all the support and cooperation extended throughout for actualizing the project in time. Preface This book introduces the fundamental concepts of different tools and techniques related to Data Science that are widely used in a variety of

applications such as statistical analytics, business data analytics, social media analytics, and big data analytics. Topics covered in the book include fundamentals of Data Science, data preprocessing, data plotting and visualization, statistical data analysis, machine learning for data analysis, time-series analysis, deep learning for Data Science, social media analytics, business analytics, and big data analytics. The content of each Chapter describes the fundamentals together with how various data analysis techniques can be implemented using different tools and libraries of Python programming language. Readers with previous knowledge of python programming will find it easy to understand the program examples presented in the chapters. Each chapter contains numerous examples and illustrative output that explains the important concepts. An appropriate number of questions is presented at the end of each chapter for self-assessing the conceptual understanding. The references presented at the end of every chapter will help the readers to explore more on a given topic.Over the ten chapters in this book, you will learn the following: Chapter 1 discusses the various aspects of Data science, such as the importance and need of Data Science in today’s marketing trend. The main discussions covered in this Chapter are the complete life cycle of data analytics, the various types of data analytics, and the major tools required for data analysis. Coverage of the role of SQL in Data Science has been provided, as well as the various pros and cons of studying Data Science are explained to understand the current demand of this area of study. All the topics discussed in this Chapter are crucial to be learned by a reader for exploring Data science and for building a career in Data Science, which is and will remain as one of the most demanding careers till the next few decades. Chapter 2 gives a brief introduction to data preprocessing and its need for data analytics. The various data types and the possible error types that occur on data are also discussed in detail. Various standard data preprocessing operations are elaborately explained, starting from data cleaning, data integration, data transformation, data reduction, and data discretization. For each data preprocessing method, simple examples are provided, and the corresponding Python code is given to demonstrate how the preprocessing operation works. Chapter 3 gives a detailed explanation of the importance of data plotting and data visualization in data analytics. The importance and use of various data visualization graphs have been discussed. Several basic, specialized, and advanced visualization tools and the corresponding libraries used for each tool have been covered in detail. For each visualization tool, illustrative examples with well-explained Python code and corresponding output are also provided. This chapter covers an important concept of Data Science as the findings in a visualization graph may be subtle, yet it can create a profound impact on a data analyst to interpret the information easily. Chapter 4 introduces the importance of statistics in data analysis. The chapter starts with the role of statistics in data analysis, and then it elaborately discusses the two main kinds of statistics commonly used for data analysis, namely the descriptive statistics and the inferential statistics. The chapter also introduces probability theory and explains the various concepts related to probability theory. The later part of the chapter discusses Bayesian probability and provides illustrations and Python code for the same. Chapter 5 discusses in detail the concepts of machine learning used for data analysis. The chapter initially introduces the concept of machine learning and its primary role in data analysis. The various machine learning techniques that have evolved and established their presence in data analytics, such as supervised learning, unsupervised learning, and reinforcement learning, have been discussed in detail in the chapter. Chapter 6 starts with an overview of time-series analysis and also discusses the important characteristics of time-series data. The latest standard time-series models are discussed along with the mathematical foundations and corresponding related illustrative examples. Each of the time-series modeling techniques is also explained using appropriate Python programs to correlate the applicability of the theory to applications. Chapter 7 gives an overview of Deep learning relating to Data Science. The prerequisite concepts of deep learning are elaborately discussed here. The chapter also broadly explains the two main deep learning techniques that are widely used for data analysis, namely the Convolutional Neural Network and the AutoEncoder. Each of the deep learning techniques is explained using appropriate Python programs to correlate the applicability of the theory to applications. Chapter 8 gives a detailed overview of social media analytics and emphasizes the impact of data and analytics on social media. Social media data is huge and highly dynamic and unstructured. The chapter discusses how such data on social media can be extracted for further analysis. It also explains the various important analyses carried out for social media data such as social network analysis, text analysis, and trend analysis. Chapter 9 introduces to readers an overview of business analytics and its role in making decisions for businesses. The main discussions covered in detail in this Chapter are the various standard applications of business analytics, namely financial analytics, market analytics, operational analytics, customer analytics, and employee analytics.The chapter also includes Python codes to explain some of the practical approaches used by the business analytics team for analysis of business-related data. Chapter 10 gives a detailed overview of big data and its various characteristics. The architectural framework of big data is explained in the next section. The chapter also elaborately discusses the Hadoop system by analyzing the Hadoop ecosystem, Hadoop Distributed File System, and the MapReduce Framework. Also, the uses of Snakebite, Pig, and Spark are explained in detail in this Chapter. Another important aspect for any reader to learn in Big Data analytics – installation of Hadoop – is also elaborately covered in the last Chapter. Errata

We take immense pride in our work at BPB Publications and follow best practices to ensure the accuracy of our content to provide with an indulging reading experience to our subscribers. Our readers are our mirrors, and we use their inputs to reflect and improve upon human errors if any, occurred during the publishing processes involved. To let us maintain the quality and help us reach out to any readers who might be having difficulties due to any unforeseen errors, please write to us at : errata@bpbonline.com Your support, suggestions and feedbacks are highly appreciated by the BPB Publications’ Family. Table of Contents 1. Fundamentals of Data Science Structure Objectives 1.1. Introduction to data science 1.2. Why learn data science? 1.3. Data analytics lifecycle 1.3.1. Data discovery 1.3.2. Data preparation 1.3.3. Model planning 1.3.4. Model building 1.3.5. Communicate results 1.3.6. Operationalization 1.4. Types of data analysis 1.4.1. Descriptive analysis 1.4.2. Diagnostic analysis 1.4.3. Predictive analysis 1.4.4. Prescriptive analysis 1.5. Types of jobs in Data Analytics 1.5.1. Data analyst 1.5.2. Data scientist 1.5.3. Data engineer 1.5.4. Database administrator 1.5.5. Data architect 1.5.6. Analytics manager 1.6. Data science tools 1.6.1. Python programming 1.6.2. R programming 1.6.3. SAS

1.6.4. Tableau Public 1.6.5. Microsoft Excel 1.6.6. RapidMiner 1.6.7. Knime 1.6.8. Apache Spark 1.7. Fundamental areas of study in data science 1.7.1. Machine learning 1.7.2. Deep learning 1.7.3. Natural Language Processing (NLP) 1.7.4. Statistical data analysis 1.7.5. Knowledge discovery and data mining 1.7.6. Text mining 1.7.7. Recommender systems 1.7.8. Data visualization 1.7.9. Computer vision 1.7.10. Spatial data management 1.8. Role of SQL in data science 1.9. Pros and cons of data science 1.10. Conclusion References Points to remember Exercises 1. Data Preprocessing Structure Objectives 2.1. Introduction to data preprocessing 2.2. Data types and forms 2.3. Possible data error types 2.4. Various data preprocessing operations 2.4.1. Data cleaning 2.4.1.1. Filling missing values 2.4.1.2. Smoothing noisy data 2.4.1.3. Detecting and removing outliers 2.4.2. Data integration 2.4.3. Data transformation

2.4.3.1. Rescaling data 2.4.3.2. Normalizing data 2.4.3.3. Binarizing data 2.4.3.4. Standardizing data 2.4.3.5. Label encoding 2.4.3.6. One hot encoding 2.4.4. Data reduction 2.4.4.1. Dimensionality reduction 2.4.4.2. Data cube aggregation 2.4.4.3. Numerosity reduction 2.4.5. Data Discretization 2.5. Conclusion References Points to remember Exercises 1. Data Plotting and Visualization Structure Objectives 3.1. Introduction to data visualization 3.2. Visual encoding 3.3. Data visualization software 3.4. Data visualization libraries 3.4.1. The matplotlib library 3.4.2. The seaborn library 3.4.3. The ggplot library 3.4.4. The Bokeh library 3.4.5. The plotly library 3.4.6. The pygal library 3.4.7. The geoplotlib library 3.4.8. The Gleam library 3.4.9. The missingno library 3.4.10. The Leather library 3.5. Basic data visualization tools 3.5.1. Histograms

3.5.2. Bar charts/graphs 3.5.3. Scatter plots 3.5.4. Line charts 3.5.5. Area plots 3.5.6. Pie charts 3.5.7. Donut charts 3.6. Specialized data visualization tools 3.6.1. Boxplots 3.6.2. Bubble plots 3.6.3. Violin plots 3.6.4. Heat map 3.6.5. Dendrogram 3.6.6. Radar chart 3.6.7. Venn diagram 3.6.8. Treemap chart 3.6.9. Parallel coordinates 3.6.10. 3D scatter plots 3.7. Advanced data visualization tools 3.7.1. Wordclouds 3.7.2. Chord diagram 3.7.3. Waffle charts 3.8. Visualization of geospatial data 3.8.1. Libraries used for geospatial data 3.8.2. Tools used for geospatial data 3.8.2.1. Choropleth map 3.8.2.2. Bubble map 3.8.2.3. Connection map 3.9. Data visualization types 3.10. Conclusion References Points to remember Exercises 1. Statistical Data Analysis Structure Objective

4.1. Role of statistics in data science 4.2. Kinds of statistics 4.2.1. Descriptive statistics 4.2.1.1. Measures of frequency 4.2.1.2. Measures of central tendency 4.2.1.3. Measures of dispersion 4.2.1.4. Measures of position 4.2.2. Inferential statistics 4.2.2.1. Hypothesis testing 4.2.2.1.1. Parametric hypothesis tests One sample parametric tests Two samples parametric tests 4.2.2.1.2. Non-parametric hypothesis tests One sample non-parametric tests Two paired samples non-parametric tests Two independent samples non-parametric tests 4.2.2.2. Estimation of parameter values 4.3. Probability theory 4.3.1. Random variables 4.3.2. Independence 4.3.3. Sample space 4.3.4. Odds and risks 4.3.5. Expected values 4.3.6. Standard errors 4.3.7. Monte Carlo simulation 4.3.8. Four perspectives on probability 4.3.8.1. The classical approach 4.3.8.2. The empirical approach 4.3.8.3. The subjective approach 4.3.8.4. The axiomatic approach 4.3.9. Bayesian probability 4.3.10. Probability distribution 4.3.10.1. Discrete probability distribution 4.3.10.2. Continuous probability distribution

4.4. Conclusion References Points to remember Exercises 1. Machine Learning for Data Science Structure Objectives 5.1. Overview of machine learning 5.2. Supervised machine learning 5.2.1. Regression methods 5.2.1.1. Linear regression 5.2.1.2. Polynomial regression 5.2.1.3. Logistic regression 5.2.2. Classification methods 5.2.2.1. KNN classification 5.2.2.2. Support Vector Machine (SVM) Classification Lagrange Multiplier Support Vector Machine (SVM) 5.2.2.3. Decision tree classification 5.2.2.4. Random Forest classification 5.2.2.5. Naive Bayes classification 5.3. Unsupervised machine learning 5.3.1. Clustering methods 5.3.1.1. Fuzzy c-means clustering 5.3.1.2. Principle Component Analysis (PCA) clustering Covariance Correlation Eigenvalue & Eigenvector Principal Component Analysis Correspondence Analysis (CA/DCA) Chi-square test Scaling of coordinates 5.3.1.3. Singular Value Decomposition (SVD) clustering 5.3.1.4. Agglomerative hierarchical clustering

Ward’s Linkage Python implementation of Agglomerative Clustering 5.3.2. Association Analysis 5.3.2.1. Apriori algorithm 5.3.2.2. FP-Growth Analysis 5.3.3. Hidden Markov Model 5.3.3.1. Markov chain 5.3.3.2. Transition Matrix 5.3.3.3. Trajectory Probability Likelihood Computation (Forward Algorithm) Decoding (Viterbi Algorithm) HMM Training (Forward-Backward Algorithm) 5.4. Reinforcement learning 5.4.1. Reinforcement function 5.4.2. Delayed reward 5.4.3. Parallel execution of reinforcement agents 5.4.4. Value function 5.4.5. Approximating value function 5.4.6. Value iteration 5.4.7. Residual gradient algorithms 5.4.8. Q-Learning 5.4.9. Implementation of RL using Python and GYM framework Reinforcement learning into the problem 5.5. Conclusion References Points To Remember Exercises 1. Time-Series Analysis Structure Objectives 6.1. Overview of time-series analysis 6.2. Components of time-series 6.3. Time-series forecasting models 6.3.1. Time-series forecasting using stochasticmodels 6.3.1.1. The Autoregressive Moving Average model

6.3.1.2. The Autoregressive Integrated Moving Average model 6.3.1.3. The Seasonal Autoregressive Integrated Moving Average model 6.3.2. Time-series forecasting using Support Vector Machines Based models 6.3.2.1. Overview of SVM 6.3.2.2. Forecasting using SVR 6.3.3. Time-series forecasting using Artificial Neural Network 6.3.3.1. Overview of ANN 6.3.3.2. Multilayer perceptron 6.3.3.3. Python code for implementing MLP Regressor 6.4. Conclusion References Points To Remember Exercises 1. Deep Learning for Data Science Structure Objectives 7.1. Introduction to TensorFlow 7.2. Pytorch 7.3. Deep learning primitives 7.4. Convolutional Neural Network (CNN) 7.4.1. Softmax 7.4.2. ReLU 7.4.3. The sigmoid or logistic activation function 7.4.4. What is convolution and why convolution? Why is it necessary to perform convolution? 7.4.5. Role of adding bias term to the convolution operation 7.4.6. Pooling 7.4.7. CNN block architecture 7.5. TensorFlow and CNN 7.6. CNN and data analysis 7.7. AutoEncoder 7.7.1. Convolutional Autoencoder (CAE) Convolution encode TensorFlow and autoencoder

7.7.2. Sparse autoencoder Regularization (why and what?) KL-Divergence (Kullback-Leibler Divergence) Sparsity constraint Sparse autoencoder in Pytorch Denoising autoencoder 7.8. Conclusion References Points To Remember Exercises 1. Social Media Analytics Structure Objectives 8.1. Overview of social media analytics 8.1.1. Data capturing 8.1.2. Data understanding 8.1.3. Data presentation 8.2. Seven layers of social media analytics 8.3. Social media analytics cycle 8.4. Key social media analytics methods 8.4.1. Social network analysis 8.4.1.1. Link prediction 8.4.1.2. Community detection 8.4.1.3. Influence maximization 8.4.1.4. Expert finding 8.4.1.5. Prediction of trust and distrust among individuals 8.4.2. Text analytics/mining 8.4.2.1. Text categorization 8.4.2.2. Document or text summarization 8.4.2.3. Sentiment analysis 8.4.3. Trend analytics 8.5. Accessing social media data 8.6. Challenges to social media analytics 8.7. Conclusion References

Points to remember Exercises 1. Business Analytics Structure Objectives 9.1. An overview of business analytics 9.2. The business analytics lifecycle 9.2.1. Define the business needs 9.2.2. Explore the data 9.2.3. Analyze the data 9.2.4. Predict future business-related outcomes 9.2.5. Optimize business solutions 9.2.6. Make decisions and measure the outcomes 9.2.7. Update business strategies based on generated results 9.3. Basic tools used in business analytics 9.4. Main applications in business analytics 9.4.1. Financial Analytics 9.4.2. Market analytics 9.4.3. Operational analytics 9.4.4. Customer analytics 9.4.4.1. Customer segmentation analytics 9.4.4.2. Customer lifetime value analytics 9.4.5. Employee analytics 9.4.5.1. Employee churn analytics 9.4.5.2. Employee Turnover Analytics 9.5. Challenges faced in business analytics 9.6. Conclusion References Points to Remember Exercises 1. Big Data Analytics Structure Objectives 10.1. An overview of Big Data

10.2. Hadoop 10.3. HDFS (Hadoop Distributed File System) Installing Hadoop in a single node cluster in the Ubuntu system 10.4. Interacting with HDFS 10.5. Interacting with HDFS from Python applications 10.5.1. Snakebite Creating a directory Python CLI client MapReduce and Python 10.5.2. Pig Working with the interactive mode of Pig Pig functions Pig and Python Python UDF 10.5.3. Spark Resilient Distributed Database (RDD) RDD operations 10.6. Conclusion References Points to remember Exercises CHAPTER 1 Fundamentals of Data Science “The goal is to turn data into information, and information into insight” — Carly Fiorina Data, in today’s technology-driven world, is vital in decision making. The rate at which data is being generated per day is tremendous. Every company is using data to comprehend their customers better. Data science and data analytics can gain meaningful insights that help companies in identifying possible areas of growth, streamlining of costs, better product opportunities, and effective company decisions. Data analysis can bring an impact in every sector, be it healthcare, medicine, stock market, academic institutes, and so on. Undoubtedly, data will keep growing in momentum for the next few decades and for this, IT jobs are monotonically expanding to deal with the bulk amount of Big Data that has been realized as the need of the hour in data analysis. This chapter elaborately discusses data science which is one of the most demanding careers in the 21st century. The world of data science may comprise of simple tasks such as estimating the sales of products in the coming year and viewing the trend of products in the market,or many complex tasks such as prediction of disease based on complex neural network model and classifying and recommending products based on fuzzy logic theory. John Wills , the Director of Data Engineering at Slack, has defined a data scientist as a Person who is better at statistics than any software engineer and better at software engineering than any statistician. Thus, data scientist plays a pivotal role in data analysis which is currently a very demanding area of study that is being explored at an exponential growth to gain hidden insights for better decision making. Structure The next few sections in this chapter will discuss the following topics:

Introduction to data science Why learn data science? Data analytics lifecycle Types of data analysis Types of jobs in business analytics Data science tools Fundamental areas of study in data science Role of SQL in data science Pros and cons of data science Conclusion References Points to remember Exercises Objectives After studying this chapter, you should be able to: Understand the concept and need for data science. Discuss the various phases in the data analytics lifecycle. Learn the various types of data analytics and the important tools applied in data science. Analyze the fundamental areas of study in data science 1.1. Introduction to data science Data science is the task of scrutinizing and processing raw data to reach a meaningful conclusion. Data is mined and classified to detect and study behavioral data and patterns, and the techniques used for this may vary according to the requirements. All data that is available for analysis can be classified into four types. They are nominal data, ordinal data, interval data, and ratio data. A common useful acronym used for these four types of data is NOIR (Nominal Ordinal Interval Ratio) , which means black in French. A detailed description of each of these types of data is provided in Chapter 2: Data Preprocessing . For data collection, there are two major sources of data – primary and secondary. Primary data is data that is never collected before and can be gathered in a variety of ways such as, participatory or non-participatory observation, conducting interviews, collecting data through questionnaires or schedules, and so on. Secondary data, on the other hand, is data that is already gathered and can be accessed and used by other users easily. Secondary data can be from existing case studies, government reports, newspapers, journals, books and also from many popular dedicated websites that provide several datasets. Few standard popular websites for downloading datasets include the UCI Machine Learning Repository, the Kaggle datasets, IMDB datasets, and Stanford Large Network Dataset Collection. Though there are clear benefits of using readily available secondary data, it must be however verified as to how authenticated and valid such data is. It is said that we all are data analysts in varying degrees of our everyday lives. We analyze the need and working principle of an electronic gadget before purchasing it, or we predict the demand of a particular course for the next few years in terms of job prospects before enrolling our children in that particular course. We do not need to be an exceptionally good expert in analytics to do analysis. The need for complex data analysis has been immensely felt over these years in main business sectors and companies to discover historical patterns for improving the performance of the business in the future. 1.2. Why learn data science? There has been a revolutionary change in the behavioral pattern of customers in case of online purchases, stock market investment, advertising products to other customers, and so on. Each of these activities requires an in-depth analysis of existing relevant data which makes data science a promising field of study in today’s fast-growing data-driven world.

promising field of study in today’s fast-growing data-driven world. Few of the industry verticals where data science has found its prominence and is used for operational and strategic decision making are discussed below: Ecommerce : Ecommerce sites hugely involve data science for maximizing revenue and profitability. These sites analyze the shopping and purchasing behavior of customers and accordingly recommend products to customers for more purchases online. Finance : The finance market is an emerging field in the data industry. The financial analytics market takes care of risk analysis, fraud detection, shareholders’upcoming share status, working capital management, and so on. Retail : Retail industries take care of a 360-degree view and feedback reviews of customers. The retail analytics market analyzes customers’ purchasing trends and demands in order to get products based on customers’ liking. Retail industries involve data science for optimal pricing, personalized offers, better marketing strategies, market basket analysis, stock management, and so on. Healthcare : The healthcare sector also nowadays heavily relies on analytics of patient data to predict diseases and health issues. Healthcare industries make an analysis of data-driven patient quality care, improved patient care, classification of the type of symptoms of patients and predicted health deficiencies, and so on. Education : The sources of data in education is vast, starting from student-centric data, enrollment in various courses, scholarship and fee details, examination results, and so on. Education analytics play a major role in academic institutions for better admission scenario, empowerment of students for successful examination results, and all-round student performance. Human Resource (HR) : HR analytics involves HR-related data that can be used for building strong leadership, employee acquisition, employee retention, workforce optimization, and performance management. Sports: Nowadays, sports analytics is often used in international tournaments to analyze the performance of players, the predicted scores, prevention of injuries, and the possibility of winning or losing a match by a particular team. The use of data science is nowadays found in every prominent domain, few of which have been addressed above. The few other sectors that need a mention are telecom industries, sales, supply chain management, risk monitoring, manufacturing industries, and IT companies. The recent competitions in businesses and companies consider data science no longer as an optional requirement but rather hire data analysts and data scientists for the same to deal with hidden massive data to provide meaningful results and generate reports to arrive at profit-making decisions. Also, the recent trends in the job market show that data analysts, data scientists, and data engineers have a huge demand in the IT companies and this demand will continue for the next decade. Hence, making data analyst, data scientist, or data engineer as a career can uplift your job profile and the demand will be witnessed in many companies in the years to come. 1.3. Data analytics lifecycle While the terms data science and data analytics are often used interchangeably, the two terms are quite different based on the difference in the scope of their performances. Data science is an umbrella term that comprises a large variety of fields compared to data analytics which is more focused and can be considered to be a subset of data science. Hence to understand data science thoroughly, let us first try to understand the various phases in the data analytics lifecycle. Data analytics involves mainly six important phases that are carried out in a cycle - data discovery, data preparation, planning of data models, the building of data models, communication of results, and operationalization. Figure 1.1 illustrates the six phases of the data analytics lifecycle that is followed one phase after another to complete one cycle. It is interesting to note that these six phases of data analytics can follow both forward and backward movement between each phase and are iterative. The lifecycle of the data analytics provides a framework for the best performances of each phase from the creation of the project until its completion. This framework was built by a large team of data scientists with much care and experiments. The key stakeholders in data science projects are business analysts, data engineers, database administrators, project managers, executive project sponsors, and data scientists.

Figure 1.1: The Data Analytics Life Cycle Let us now briefly discuss all the six phases of the data analytics lifecycle followed in any data science projects: 1.3.1. Data discovery In this first phase of data analytics, the stakeholders regularly perform the following tasks - examine the business trends, make case studies of similar data analytics, and study the domain of the business industry. The entire team makes an assessment of the in-house resources, the in- house infrastructure, total time involved, and technology requirements. Once all these assessments and evaluations are completed, the stakeholders start formulating the initial hypothesis for resolving all business challenges in terms of the current market scenario. 1.3.2. Data preparation In the second phase after the data discovery phase, data is prepared by transforming it from a legacy system into a data analytics form by using the sandbox platform. A sandbox is a scalable platform commonly used by the data scientists for data preprocessing. It includes huge CPUs, high capacity storage and high I/O capacity. The IBM Netezza 1000 is one such data sandbox platform used by the IBM Company for handling data marts. The stakeholders involved during this phase are mostly involved in the preprocessing of data for preliminary results by using a standard sandbox platform. 1.3.3. Model planning The third phase of the lifecycle is model planning, where the data analytics team makes proper planning of the methods to be adapted and the various workflow to be followed during the next phase of model building. At this stage, the various division of work among the team is decided to clearly define the workload among the team members. The data prepared in the previous phase is further explored to understand the various features and their relationships and also perform feature selection for applying it to the model. 1.3.4. Model building The next phase of the lifecycle is model building in which the team works on developing datasets for training and testing as well as for production purposes. Also, the execution of the model, based on the planning made in the previous phase, is carried out. The kind of environment needed for execution of the model is decided and prepared so that if a more robust environment is required, it is accordingly applied. 1.3.5. Communicate results Phase five of the life cycle checks the results of the project to find whether it is a success or failure. The result is scrutinized by the entire team along with its stakeholders to draw inferences on the key findings and summarize the entire work done. Also, the business values are quantified and an elaborate narrative on the key findings is prepared that is discussed among the various stakeholders. 1.3.6. Operationalization In phase six, a final report is prepared by the team along with the briefings, source codes, and related documents. The last phase also involves running the pilot project to implement the model and test it in a real-time environment. As data analytics help build models that lead to better

decision making, it, in turn, adds values to individuals, customers, business sectors and other organizations. While proceeding through theses six phases, the various stakeholders that can be involved in the planning, implementation, and decision-making are data analysts, business intelligence analysts, database administrators, data engineers, executive project sponsors, project managers, and data scientists. All these stakeholders are rigorously involved in the proper planning and completion of the project, keeping in note the various crucial factors to be considered for the success of the project. 1.4. Types of data analysis There are many different ways to analyze data. Some forms are more complex than others based on which data analysis has been broadly divided into four types, namely descriptive analysis, diagnostic analysis, predictive analysis, and prescriptive analysis. Figure 1.2 demonstrates the level of complexity of each of these four types of data analysis. Figure 1.2: Four types of data analysis based on the level of complexity Let us briefly discuss each of the four types of data analysis and find how each of these types differs from one another: 1.4.1. Descriptive analysis Descriptive analysis is the simplest and the most common type of data analysis used by companies and other sectors. This type of data analysis is mostly used in businesses to generate monthly revenue reports, sales leads, and key performance indicators (KPI) dashboards. It describes the main aspects of the data being analyzed. The data dealt with are large in volume and often include the entire population. The results or reports generated are based on data that are already available. The main emphasis in the descriptive analysis is given on ‘ what has happened?’ by analyzing valuable information found from the available past data. For example, with descriptive analysis, a data analyst will be able to generate the statistical results of the performance of the cricket players of team India. For generating such results, the data may need to be integrated from multiple data sources to gain meaningful insights through statistical analysis. 1.4.2. Diagnostic analysis The diagnostic analysis differs from the descriptive analysis by simply not emphasizing only ‘ what has happened ?’ but also on ‘ why it happened?’ This type of data analysis tries to gain a deeper understanding of the reasons behind the pattern of data found in the past. Here, business intelligence comes into play by digging down to find the root cause of the pattern or nature of data obtained. For example, with diagnostic analysis, a data analyst will be able to find why the performance of each player of the cricket team of India has risen (or degraded) in the recent past six months. The diagnostic analysis deals with the critical aspect of finding the reason behind a particular change or cause in a phenomenon. This is undoubtedly a major task in the field of data analysis as an analyst has to be critical and correct enough to find the reason behind a particular cause of occurrence to make a gain or profit in various fields. For this purpose, an analyst often uses machine learning techniques to use business intelligence for a deeper understanding of a given problem. 1.4.3. Predictive analysis Predictive analysis, as the name suggests, deals with prediction of future based on the available current and past data. The main emphasis in the descriptive analysis is given on ‘ what is likely to happen? ’ by utilizing previous data to find the future outcome. For example, with predictive analysis, a data analyst will be able to predict the performance of each player of the Indian cricket team for the upcoming international cricket world cup. Such prediction can help the Board of Cricket Council of India (BCCI) to decide on the players’ selection for the upcoming international cricket tournament.

Predictive analysis is applied in many domains such as risk management, sales forecasting, weather forecasting, and prediction of the performance of each team. Though descriptive and diagnostics analyses are more common in nature, data analysts are also largely hired in companies to predict future trends in businesses and other marketing sectors. In most cases prediction is made by dividing the available dataset into the training set and testing set and the machine learning algorithm is applied to check the accuracy level of prediction. If the accuracy of prediction is found to be at a satisfactory level, the algorithm is then used to predict future data. However, it is important to remember that the predicted solution provides an approximate forecasted result and may vary from the actual result, as accuracy is not guaranteed to a hundred percent. 1.4.4. Prescriptive analysis The final type of data analysis which is the highest in terms of complexity is called predictive analysis . In this type of data analysis, the insights gained from all the other three types of data analyzes are combined to determine the kind of action to be taken to solve a certain situation. Predictive analysis prescribes what steps are needed to be taken to avoid a future problem. It involves a high degree of responsibility, time, and complicacy to reach to informed decision-making. Thus, the predictive analysis makes recommendations based on the forecasting done in predictive analysis. To summarize the four main types of data analytics, it should be remembered that descriptive analysis is mainly involved in explaining what has happened till date, diagnostic analysis emphasizes more on finding why it has happened in a particular way, predictive analysis makes a forecast on what might happen in the near future, while prescriptive analysis emphasizes on recommending actions based on the forecast. All these types of analyses are usually carried out by a data analyst or data scientist to deal with the given data and produce a meaningful outcome based on the type of analysis required to be made. 1.5. Types of jobs in data analytics The various key stakeholders in any data analysis project include the data analyst, the data scientist, the data engineer, the database administrator, and the analytics manager. Each stakeholder has a clear role to play for a business problem right from understanding the essentials of the problem, proper planning, implementation of the project, analyzing the various outcomes of the project, solving the bottlenecks visible in the outcomes, and generating reports by drawing inferences about the success of the project. Figure 1.3 shows some of the key stakeholders involved in any data analytics-based project. Figure 1.3: Some of thekey stakeholders in the Data Analytics projects Though it is a big team that may involve many other stakeholders such as analytics specialist, business intelligence consultant, chief creative officer, ETL Developers, project sponsor, and many more, the main prominent workers involved in the project are few and play a pivotal role in bringing success to a project. The leader of any business project clearly defines the role of the stakeholders and the estimated timeline of each of the work assigned. Let us briefly discuss six such main stakeholders involved in business analytics, namely the data analyst, the data scientist, the data engineer, the database administrator, the data architect, and the analytics manager. 1.5.1. Data analyst The main role of a data analyst is to extract data and interpret the information attained from the data for analyzing the outcome of a given problem in business. In this process, the analyst also discovers the various bottlenecks that are found in the results and provides possible solutions for the same. Extraction of information from given existing data is done using one or more standard methodologies such as data cleaning, data transformation, data visualization, and data modeling. Using these methodologies, a data analyst is able to make careful data-driven decisions.

Statistics

Uploader

Data Science Fundamentals and Practical Approaches Understand Why Data Science Is the Next (Dr. Gypsy Nandi, Dr. Rupam Kumar Sharma) (z-library.sk, 1lib.sk, z-lib.sk)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Blog & Notes

Recommended for You

Statistics

Uploader

Data Science Fundamentals and Practical Approaches Understand Why Data Science Is the Next (Dr. Gypsy Nandi, Dr. Rupam Kumar Sharma) (z-library.sk, 1lib.sk, z-lib.sk)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Blog & Notes

Recommended for You