Author:Kedeisha Bryan & Maaike van Putten
No description
Tags
Support Statistics
¥.00 ·
0times
Text Preview (First 20 pages)
Registered users can read the full content for free
Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.
Page
1
(This page has no text content)
Page
2
Becoming a Data Analyst Copyright © 2023 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Early Access Publication: Becoming a Data Analyst Early Access Production Reference: B21107 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN: 978-1-80512-641-6
Page
3
www.packt.com
Page
4
Table of Contents 1. Becoming a Data Analyst: A beginner’s guide to kickstarting your data analysis journey 2. 1 Understanding the Business Context of Data Analysis I. Join our book community on Discord II. A Data Analyst’s Role in the Data Analytics Lifecyle i. Business Understanding ii. Data Inspection iii. Data Pre-processing & Preparation iv. Exploratory Data Analysis v. Data Validation vi. Explanatory Data Analysis III. Summary 3. 2 Introduction to SQL I. Join our book community on Discord II. SQL and its use cases i. Brief History of SQL ii. SQL and data analysis III. Different Databases i. Relational vs. Non-Relational Databases ii. Popular DBMS’s IV. SQL Terminology i. Query ii. Statement iii. Clause iv. Keyword v. View V. Setting up your environment i. Choosing a DBMS ii. Installing necessary software iii. Creating a sample database VI. Writing Basic SQL Queries i. SELECT Statement ii. Structure of a Query
Page
5
iii. INSERT Statements iv. UPDATE Statements v. DELETE Statements vi. SQL basic rules and syntax VII. Filtering and organizing data with clauses i. WHERE Clause ii. ORDER BY Clause iii. DISTINCT Clause iv. LIMIT Clause VIII. Using operators and functions i. Comparison operators ii. Logical operators (AND, OR) iii. LIKE operator iv. Arithmetic operators v. Functions for calculations vi. Functions for text manipulation vii. Date functions IX. Summary 4. 3 Joining Tables in SQL I. Join our book community on Discord II. Table relations i. Implementing Relationships in SQL ii. SQL Joins iii. Best Practices for Using JOIN in SQL III. Summary 5. 4 Creating Business Metrics with Aggregations I. Join our book community on Discord II. Aggregations in Business Metrics i. Aggregations in SQL to Analyze Data ii. GROUP BY Clause iii. HAVING clause iv. Best Practices for Aggregations III. Summary 6. 5 Advanced SQL I. Join our book community on Discord II. Working with subqueries i. Types of subqueries
Page
6
ii. Non-code explanation of subquery iii. Using a basic subquery on our library database iv. Subquery vs joining tables v. Rules of subquery usage vi. More advanced subquery usage on our library database III. Common Table Expressions i. Use cases for CTEs ii. Examples with the library database IV. Window functions: A panoramic view of your data i. Example with the Library Database V. Understanding date time manipulation i. Date and time functions ii. Examples with the library database VI. Understanding text manipulation i. Text functions ii. Text functions in the library databaseU VII. Best practices: bringing it all together i. Write readable SQL Code ii. Be careful with NULL values iii. Use subqueries and CTEs wisely iv. Think about performance v. Test your queries VIII. Summary 7. 6 SQL for Data Analysis Case Study I. Join our book community on Discord II. Setting up the database III. Performing data analysis with SQL IV. Exploring the data i. General data insights V. Analyzing the data i. Examining the clothing category ii. Determining the number of customers iii. Researching the top payment methods iv. Gathering customer feedback v. Exploring the relationship between ratings and sales vi. Finding the percentage of products with reviews vii. Effectiveness of discounts
Page
7
viii. Identifying the top customers ix. Top-selling clothing products x. Top 5 high-performing products xi. Most popular product by country xii. ADDResearching the performance of delivery xiii. Future projections with linear regression VI. Summary 8. 7 Fundamental Statistical Concepts I. Join our book community on Discord II. Descriptive statistics i. Levels of measurement ii. Measures of central tendency iii. Measures of variability III. Inferential statistics i. Probability theory ii. Probability distributions iii. Correlation vs causation IV. Summary 9. 8 Testing Hypotheses I. Join our book community on Discord II. Technical requirements (H1 – Section) III. Introduction to Hypothesis Testing i. Role of Hypothesis Testing in Data Analysis ii. Null and Alternative Hypothesis iii. Step by Step Guide to Performing Hypothesis Testing IV. One Sample t-Test V. Conditions for Performing a One-Sample T-Test i. Case Study: Average Exam Scores VI. Two Sample t-Test i. Case Study: Comparing Exam Scores Between Two Schools VII. Chi Square Test i. Case Study: Effect of Tutoring on Passing Rates VIII. Analysis of Variance (ANOVA) i. Case Study: Comparing Exam Scores Among Three Schools IX. Summary 10. 9 Business Statistics Case Study I. Join our book community on Discord
Page
8
II. Technical requirements (H1 – Section) III. Case Study Overview i. Learning Objectives: ii. Questions: iii. Solutions: IV. Additional Topics to Explore i. Text Analytics ii. Big Data iii. Time Series Analysis iv. Predictive Analytics v. Prescriptive Analytics & Optimization vi. Database Management V. Where to practice VI. Summary 11. 10 Data analysis and programming I. Join our book community on Discord II. The role of programming and our case III. Different programming languages i. Python ii. R iii. SQL iv. Julia v. MATLAB IV. Working with the Command Line Interface (CLI) i. Command Line Interface (CLI) vs Graphical User Interface (GUI) ii. Accessing the CLI iii. Typical CLI tasks iv. Using the CLI for programming V. Setting up your system for Python programming i. Check if Python is installed ii. MacOS iii. Linux iv. Windows v. Browser (cloud-based) vi. Testing the Python setup VI. Python use cases for CleanAndGreen
Page
9
i. Data Cleaning and Preparation ii. Data Visualization iii. Statistical Modeling iv. Predictive Modeling/Machine Learning v. General remarks on Python VII. Summary 12. 11 Introduction to Python I. Join our book community on Discord II. Understanding the Python Syntax i. Print Statements ii. Comments iii. Variables iv. Operations on variables v. Operators and Expressions III. Exploring Data Types in Python i. Strings ii. Integers iii. Floats iv. Booleans v. Type Conversion IV. Indexing and Slicing in Python V. Unpacking Data Structures i. Lists ii. Dictionaries iii. Sets iv. Tuples VI. Mastering Control Flow Structures i. Conditional Statements in Python ii. Looping in Python VII. Functions in Python i. Creating Your Own Functions ii. Python Built-In Functions VIII. Summary 13. 12 Analyzing data with NumPy & Pandas I. Join our book community on Discord II. Introduction to NumPy i. Installing and Importing NumPy
Page
10
ii. Basic NumPy Operations III. Statistical and Mathematical Operations i. Mathematical Operations with NumPy Arrays IV. Multi-dimensional Arrays i. Creating Multi-dimensional Arrays ii. Accessing elements in Multi-dimensional Arrays iii. Reading Data from a CSV File V. Introduction to Pandas i. Series and DataFrame ii. Loading Data with Pandas iii. Data Analysis with Pandas iv. Data Analysis VI. Summary 14. 13 Introduction to Exploratory Data Analysis I. Join our book community on Discord II. The Importance of EDA i. The EDA Process ii. Tools and Techniques III. Univariate Analysis i. Analyzing Continuous Variables ii. Analyzing Categorical Variables IV. Bivariate Analysis i. Understanding bivariate analysis ii. Correlation vs Causation iii. Visualizing relationships between two continuous variables V. Multivariate analysis i. Heatmaps ii. Pair plots VI. Summary 15. 14 Data Cleaning I. Join our book community on Discord II. Technical requirements III. Importance of data cleaning i. Impact on data quality ii. Relevance to business decisions IV. Common data cleaning challenges
Page
11
i. Inconsistent formats ii. Misspellings and Inaccuracies iii. Duplicate records V. Dealing with missing values i. Causes of missing values ii. Strategies for handling missing values iii. Types of missing data VI. Dealing with duplicate values i. Causes of duplicate data ii. Identification and removal VII. Dealing with outliers i. Types of outliers ii. Impact on analysis iii. Techniques for identifying and handling outliers VIII. Cleaning and transforming data i. Handling inconsistencies ii. Converting categorical data iii. Normalizing numerical features IX. Data validation i. Validation methods X. Summary 16. 17 Exploratory Data Analysis Case Study I. Join our book community on Discord II. Technical Requirements III. E-commerce Sales Optimization Case Study i. Time Series Analysis ii. Customer Segmentation iii. Product Analysis iv. Payment and Returns v. Case Study Answers IV. Summary
Page
12
Becoming a Data Analyst: A beginner’s guide to kickstarting your data analysis journey Welcome to Packt Early Access. We’re giving you an exclusive preview of this book before it goes on sale. It can take many months to write a book, but our authors have cutting-edge information to share with you today. Early Access gives you an insight into the latest developments by making chapter drafts available. The chapters may be a little rough around the edges right now, but our authors will update them over time.You can dip in and out of this book or follow along from start to finish; Early Access is designed to be flexible. We hope you enjoy getting to know more about the process of writing a Packt book. 1. Chapter 1: Understanding the business context of data analysis 2. Chapter 2: Introduction to SQL 3. Chapter 3: Joining Tables 4. Chapter 4: Creating Business Metrics with Aggregations 5. Chapter 5: Advanced SQL 6. Chapter 6: SQL for Data Analysis Case Study 7. Chapter 7: Fundamental statistics concepts 8. Chapter 8: Testing hypotheses 9. Chapter 9: Business Statistics Case Study 10. Chapter 10: Data analysis and programming 11. Chapter 11: Introduction to Python 12. Chapter 12: Analyzing data in NumPy & Pandas 13. Chapter 13: Introduction to Exploratory Data Analysis 14. Chapter 14: Data cleaning 15. Chapter 15: Univariate Analysis 16. Chapter 16: Bivariate Analysis 17. Chapter 17: Exploratory Data Analysis Case Study 18. Chapter 18: Introduction to Data Visualization 19. Chapter19: Choosing the right Visualization
Page
13
1 Understanding the Business Context of Data Analysis
Page
14
Join our book community on Discord https://packt.link/EarlyAccessCommunity In a research paper written by David Becker in 2017, some of the leading causes of failed data projects relate to failures relating to the management of a project. These include incorrect business objectives, improper scope of a project, incorrect project structure, and poor communication.There is only so much you can control regarding the success of a project, but it’s important to know how to manage your portion of the project. In this chapter we will cover the roles and responsibilities of a data analyst in each phase of a data project, helpful tools for project success, and a typical technical tools a data analyst can be expected to use. A Data Analyst’s Role in the Data Analytics Lifecyle The data analytics lifecycle is developed from the cross-industry standard for data mining. Also known as CRISP-DM. The major phases include business understanding, data understanding, data preparation, modeling, evaluation, and deployment.As our focus will not be creating models, the data analytics lifecycle is similar except the substitution of exploratory data analysis, data validation, and presentation. Business Understanding
Page
15
The first and probably most important phase is business understanding as your work here will set the direction for the rest of the project. Mistakes here result in performing unnecessary work or providing a solution that does not solve the intended problem. There are about 3 main areas of this phase: 1. Defining business objectives: Here you will be defining the actual goal of the data project. The basis for every project is to provide a solution that solves a problem or improve a process. An important concept is to understand symptoms vs the root cause. The symptoms will be the visual signs or effects of an issue of a system. Symptoms trigger the investigation of an issue or the need for a solution. The root cause is the underlying issue that all the symptoms stem from. Unlike symptoms, root causes are often not visible or apparent without a thorough investigation. Addressing the symptoms will only provide temporary fixes, while addressing the root cause will resolve the problem more permanently. When speaking with stakeholders, often they may spend most of their time speaking about the symptoms. Many times, what they say is the problem really isn’t the problem. As a data analyst, you must know how to ask the right questions to sift through symptoms to figure out the root cause and the correct business problem. Tools for success: Five Whys and Fishbone diagrams. The Five Whys is a quick and effective technique to uncover a root cause. Where you begin with a problem statement and follow with asking “why” five times or any amount needed to land at the underlying root cause. Below is a diagram the visually depicts the process.
Page
16
The Fishbone Diagram, as pictured below, is a visual aid to identify and organize the possible causes and effects of a problem. This is also a great method for prioritizing the different root problems to solve.
Page
17
1. Gather relative information: Once you determine your business objective, your next task is to gather the data and any additional information regarding the project. Typical task includes identifying your data sources. This will be internal sources and may include external data as well. The data sources can include company databases, reports, or third-party research. Other data sources can include additional interviews with stakeholders, web scraping, or surveys. Tools for success: An important skill for a data analyst is the ability to navigate an organization to source all the information and data that you need. This will be done through interviews and emails. Its also important to know that you will work with other people who will have different roles and responsibilities within the project. A great way to keep all this information organized is a RACI matrix, pictured below. A RACI matrix helps you keep track of the responsibilities of others in your project and whom you consulted and informed. This is a great way to ensure a high level of communication in a project.
Page
18
Determine key performance indicators (KPIs) and metrics: KPIs and metrics are used to measure and evaluate a business process. A metric is a measure used to track and monitor a business process. Examples include website traffic and profit margin. A KPI is a measurable value that’s often tied to business goals and strategy. Examples include net promoter score, conversion rates, and customer lifetime value. While they are very similar in nature, KPIs are more closely aligned with business goals while metrics are used to track any business activity. When speaking with stakeholders, you would also like to define the critical success factors, the essential activities that must go well in order for the objective to be achieved. Based on those success factors, you will develop your KPIs and metrics. These measures will be part of the essential data used for business decisions. 1. Clarify scope of work: A project can go in many directions as there can be many problems to investigate. To avoid scope creep, where a project encounters uncontrolled changes over time, you want to establish clear goals and priorities. Tools for Success: A Scope of Work (SOW) or project charter are excellent tools to ensure a clear scope for your project. It is a document that can summarize all the information gathered in the previous steps that would
Page
19
include the project overview, your tasks, expected results, timeline, and expected deliverable. The expected deliverable is one of the most important elements as it defines the work you will present at the end of the project. Deliverables can include a dashboard, setting up an automated process to maintain the dashboard, a simple report, or a PowerPoint presentation. Data Inspection Once the business objective has been established, a data analyst will then set out to understand their data. This phase will introduce more technical work involving the initial data collection. There are three major areas: 1. Collect initial data: You already identified your data sources, here you will explore your company’s databases, reports, or extract external data through web scraping to build your initial dataset. 2. Determine data availability: Identify how often this data is gathered or updated. Also, how you would be able to access it for future use. 3. Explore data and characteristics: Where you take a first look to identify important variables, data types, and the format of your data. You will also determine if you need to gather more data, data enrichment, and identify the initial data that will be used for analysis. Tools for success: To collect the initial data, a data analyst can typically use SQL to explore a database. If there is data that needs to be scraped from the web or even a PDF file, programming tools such as python or R can be used. Data Pre-processing & Preparation Most likely your data will be messy. In this phase you will be correcting, formatting, and transforming your data for your analysis. Here, we will provide a brief overview as more detail will be presented in chapter 14. The major elements of this phase can include: Data validation: You want to ensure that your data follows the business logic or rules. This can be investigated by exploring the ranges or formatting of your variables. Mistakes can occur, especially when data is entered manually. Essentially, you verify whether your data makes
Page
20
sense. For example, if you have a customer table with an age of 150, this indicates an error. Missing value treatment: Missing data is a common issue when data cleaning. There are multiple methods that include deletion, ignoring the missing data, assigning a missing data column, and imputation. More details on these methods will be explained in Removing duplicates: Duplicates will lead to misleading numbers and will cause incorrect conclusions. Outlier treatment: Where you will identify outliers and determine how your will treat them. An outlier is a data point that is significantly different from most of the other data points. They are normally the result of normal variation of a process or errors. There are multiple treatment methods including ignoring them, imputation, deletion, or transformation. Data normalization: When data is moved through different tools and phases during the data pipeline, the data types may involuntarily convert. Here you will fix the formatting, convert units of measurements, or standardize categorical data. Feature engineering: Where you transform variables to better represent the data. This can involve binning, aggregations, or combining variables. Exploratory Data Analysis After successfully cleaning the data, now it is time to explore the data. Here, we will provide a brief overview as more detail will be presented in chapters 7, 13, 15, 16, and 17. In this phase, a data analyst will conduct univariate, bivariate, and multivariate analyses. The analyses will identify patters, trends, and other information to uncover insights that support the business objective. Tools for success: For EDA, a data analyst can use tools such as SQL, Excel, or business intelligence tools such as Tableau, Power BI, or Looker. Knowledge of descriptive statistics is necessary to understand how to summarize and measure your data. To provide more advanced analysis for decision making, hypothesis testing is helpful for building and testing assumptions. It can enhance the credibility of your findings to support your recommendations. Data Validation
Comments 0
Loading comments...
Reply to Comment
Edit Comment