Statistics
62
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2025-11-23

AuthorAndrew Oleksy

A Step By Step Guide with Visual Illustrations and Examples The Data Science field is expected to continue growing rapidly over the next several years and Data Scientist is consistently rated as a top career.Data Science with R gives you the necessery theoretical background to start your Data Science journey and shows you how to apply the R programming language through practical examples in order to extract valuable knowledge from data. Professor Andrew Oleksy guides you through all important concepts of data science including the R programming language, Data Mining, Clustering, Classification and Prediction, Hadoop framework and more.

Tags
No tags
Publisher: Andrew Oleksy
Publish Year: 2018
Language: 英文
Pages: 201
File Format: PDF
File Size: 7.8 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

(This page has no text content)
DATA SCIENCE WITH R: BY ANDREW OLEKSY Copyright © 2018 by Andrew Oleksy. All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any form or by any means, including photocopying, recording or other electronic or mechanical methods or by any information storage or retrieval system without the prior written permission of the publisher, except in the case of very brief quotations embodied in critical reviews and certain other non-commercial uses permitted by copyright law
TABLE OF CONTENTS Table of Contents Chapter 1: Introduction to Data Mining Summary Prerequisite Knowledge Introduction to Data Mining 1.1 Data Science 1.2 Knowledge Discovery in Databases (KDD) 1.2.1 Data Collection 1.2.2 Preprocessing 1.2.3 Transformation 1.2.4 Data Mining 1.2.5 Interpretation and Evaluation 1.3 Model Types 1.4 Examples and Counterexamples 1.5 Classification of Data Mining methods 1.5.1 Classification 1.5.2 Regression 1.5.3 Clustering 1.5.4 Extraction and Association Analysis 1.5.5 Visualization 1.5.6 Anomaly Detection 1.6 Applications 1.6.1 Medicine 1.6.2 Finance 1.6.3 Telecommunications 1.7 Challenges 1.8 The R Programming Language
1.9 Basic Concepts, Definitions and Notations 1.10 Tool Installation Chapter 2: Introduction to R Summary Prerequisite Knowledge Introduction to R 2.1 Data Types 2.1.1 Definition and Object Classes 2.1.2 Vectors and Lists 2.1.3 Matrix 2.1.4. Factors and Nominal Data 2.1.5 Missing Values 2.1.6 Data Frames 2.2 Basic Tasks 2.2.1 Reading Data from File 2.2.2 Sequence creation 2.2.3 Reference to Subsets 2.2.4 Vectorization 2.3 Control Structures 2.3.1 Conditional Statement: if-else 2.3.1 Loops: for, repeat and while 2.3.3 Next and break statements 2.4 Functions 2.5 Scoping Rules 2.6 Iterated Functions 2.6.1 lapply 2.6.2 sapply 2.6.3 Split 2.6.4 tapply
2.7 Help from the console and Package Installation Chapter 3: Types, Quality and Data Preprocessing Summary Prerequisite Knowledge Types, Quality and Data Preprocessing 3.1 Categories and Types of Variables 3.2 Preprocessing processes 3.2.1 Data cleansing 3.2.1.1 Missing Values 3.2.1.2 Data with Noise Example – Data smoothing using binning methods 3.2.1.3 Inconsistent data 3.2.2 Data Unification 3.2.3 Data Transformation and Discretization 3.2.3.1 Data Transformation Example – Data Regularization 3.2.3.2 Data Discretization Example – Entropy-based discretization 3.2.4 Data Reduction 3.2.4.1 Dimension Reduction 3.2.4.2 Data Compression 3.3 dplyr and tidyr packages 3.3.1 dplyr 3.3.2 tidyr Chapter 4: Summary Statistics and Visualization Summary Prerequisite Knowledge Summary Statistics and Visualization 4.1 Measures of Position
4.1.1 Mean Value 4.1.2 Median 4.2 Measures of dispersion 4.2.1 Minimum value, Maximum value, Range 4.2.2 Percentile values 4.2.3 Interquartile Range 4.2.4 Variance 4.2.5 Standard Deviation 4.2.6 Coefficient of Variation 4.3 Visualization of Qualitative Data 4.3.1 Frequency Table 4.3.2 Bar Charts 4.3.3 Pie Chart 4.3.4 Contingency Matrix 4.3.4 Stacked Bar Charts and Grouped Bar Charts 4.4 Visualization of Quantitative Data 4.4.1 Frequency Table 4.4.2. Histograms 4.4.3 Frequency Polygon 4.4.4 Boxplot Chapter 5: Classification and Prediction Summary Prerequisite Knowledge 5.1 Classification 5.1.2 Decision Trees 5.1.2.1 Description 5.1.2.2 Decision Tree creation – ID3 Algorithm 5.1.2.3 Decision Tree creation – Gini Index 5.2 Prediction
5.2.1Difference between Classification and Prediction 5.2.2 Linear Regression 5.2.2.1 Description, Definitions and Notations 5.2.2.2 Cost Function 5.2.2.3 Gradient Descent Algorithm 5.2.2.4 Gradient Descent in Linear Regression 5.2.2.5 Learning Parameter 5.3 Overfitting and regularization 5.3.1 Overfitting 5.3.2 Model Regularization 5.3.3 Linear Regression with Normalization Chapter 6: Clustering Summary Prerequisite Knowledge CLUSTERING 6.1 Unsupervised Learning 6.2 Concept of Cluster 6.3 k-means algorithm 6.3.1 Algorithm Description 6.3.2 Random Centroids Initialization 6.3.3 Choosing the number of Clusters 6.3.4 Applying k-means in R 6.4 Hierarchical Clustering Algorithms 6.4.1 Distance Measurements Between Clusters 6.4.2 Agglomerative Algorithms 6.4.3 Divisive Algorithms 6.4.4 Applying Hierarchical Clustering in R 6.5 DBSCAN Algorithm 6.5.1 Basic Concepts
6.5.2 Algorithm Description 6.5.3 Algorithm Complexity 6.5.4 Advantages 6.5.5 Disadvantages Chapter 7: Mining of Frequent Itemsets and Association Rules Summary Prerequisite Knowledge Mining of Frequent Itemsets and Association Rules 7.1 Introduction 7.2 Theoretical Background 7.3 Apriori Algorithm 7.4 Frequent Itemsets Types 7.5 Positive and Negative Border of Frequent Itemsets 7.6 Association Rules Mining 7.7 Alternative Methods for Large Itemsets generation 7.7.1 Sampling Algorithm 7.7.2 Partitioning Algorithm 7.8 FP-Growth Algorithm 7.9 Arules package Chapter 8: Computational Methods for Big Data Analysis (Hadoop and MapReduce) Summary Prerequisite Knowledge 8.1 Introduction 8.2 Advantages of Hadoop’s Distributed File System 8.3 Hadoop Users 8.4 Hadoop Architecture 8.4.1 Hadoop Distributed File System (HDFS) 8.4.2 HDFS Architecture
8.4.3 HDFS – Low Performance Areas 8.4.3.1 Low Data Access Time 8.4.3.2 Multiple Small Files 8.5.3.3 Multiple Data Recording Nodes, Arbitrary File Modifications 8.4.4 Basic HDFS Concepts 8.4.4.1 Blocks 8.4.4.2 Namenodes and Datanodes 8.4.4.3 HDFS Federation 8.4.4.4 HDFS High Availability 8.4.5 Data Flow – Data Reading 8.4.6 Network Topology in Hadoop 8.4.7 File Writing 8.4.8 Copies Placement 8.4.9 Consistency Model 8.5 The Hadoop Cluster Architecture 8.6 Hadoop Java API 8.7 Lists Loops & Generic Classes and Methods 8.7.1 Generic Classes and Methods 8.7.2 The Class Object One Last Thing...
CHAPTER 1: INTRODUCTION TO DATA MINING SUMMARY The aim of this chapter is to introduce Data Mining and Knowledge Discovery in Databases. First, some basic concepts are defined along with the reasons why this scientific field was created and expanded rapidly. Second, we will review some practical examples and counterexamples of Data Mining. Additionally, the most important fields on which Data Mining is based are presented. Last, we will find out what R programming language is and its general philosophy. We will also show how to install all necessary tools, we will present how to address to the R language manual for help, and define its basic types and functions, along with their functionality. PREREQUISITE KNOWLEDGE No previous knowledge is needed for this chapter.
INTRODUCTION TO DATA MINING The evolution of technology helped internet expand lightning fast. Over time, internet access became accessible to more and more people. This led to the development of million websites and the use of databases for storing these data. The creation of the first commercial and social webpages created the need for storing and managing large amount of data. Today, the amount of available data is huge and is growing exponentially every day. The need for minimizing the costs for collecting and storing these data was one of the biggest reasons for the growth of this scientific field. The huge amount of data stored in databases and data warehouses could not be utilized as is. In order to get useful conclusions, some necessary actions are required in order to structure the data. On this chapter we will view which are the fundamental stages in order to extract valuable and usable information from data. .1 DATA SCIENCE Data Science is a new term, which came to replace former terms like Knowledge Discovery in Database or Data Mining. Both three terms can be used to describe a semiautomated process whose main purpose is to analyze a huge volume of data about a specific problem with the purpose of creating patterns in scientific fields like Statistics, Machine Learning and Pattern Recognition. Those patters, found in multiple forms like associations, anomalies, clusters, classes etc. constitute structures or instances, which appear in data and are statistically significant. One of the most important aspects of Data Science has to do with finding or recognizing (recognizing means that the patters where not expected in advance) and evaluating these patterns. A pattern should show signs of organization across this structure. These patterns, also known as models, most of the times can be tracked with the use of measurable features or attributes which are extracted by data. Data Science is a new science, which appeared at the end of 1980 and started
growing gradually. During this era, Relational Databases were at their zenith and served data storage needs for companies and organizations with the purpose of better organizing and managing them, so that mass queries needed for their day to day operations could be accomplished faster. These Database Managing Systems (DBMS) followed the so called OLTP (OnLine Transaction Processing) model, with the purpose of processing transactions. These tools allowed the user to find answers in questions he already knew or create some references. The need for better utilization of data created by these systems – the systems which helped the daily needs of a company- led to the development of OLAP (OnLine Analytical Processing) type tools. With these OLAP tools it was easier to answer more advanced queries, allowing bigger and Multidimensional Databases (MDB) to work faster and provide data visualization. OLAP tools could also be named as data exploration tools due to the visualization of these data. These tools allowed users (sales managers, marketing managers etc.) to recognize new patters but this discovery should be made by the user. For example, a user could perform queries about the total revenue generated by multiple stores of a particular company within a country, in order to find the stores with the lowest revenue. Automation of pattern recognition was created through methodologies and tools created by the field of Data Science. Through these solutions, pattern recognition was aided by the final goal. For example, if a user wanted a report about the stores with the less revenue generated last month, he could ask from the system to find various useful insights about stores revenue. Data Science growth came gradually and was directly associated to the capability of collecting and listing huge amount of data, of different types, through the rapid expansion of fast web infrastructures on which commercial applications could rely on. One of the first companies who embraced this advancement was Amazon, which started by selling books and other products and then created a user-friendly related products recommendation system. This system was built and adjusted accordingly based on user interactions, using a method called Collaborating
Filtering. This system became the foundation of Recommender Systems. The unconditional data generation in a 24-hour basis supports a huge amount of human activities like shopping cart data, medical records, social media announcements, banking and stock market operations and so on. These data have a wide variety of types (images, videos, real time data, DNA sequences etc.) and different acquisition times. If some of these data are not analyzed immediately, it might be difficult later to be stored and processed, creating this way a new scientific field known as Big Data. Data Science’s goal is to address the needs created from this new environment and provide solutions for the escalated and sufficient process of out-of core data. Methods and tools used for this purpose have already being developed like Hadoop, Map-Reduce, Hive, MongoDB, GraphPD. The two main goals of practical Data Science are to create models, which can be used both in predicting and describing data. Prediction is about using some variables or parts of a database from which we could estimate an unknown or future value of another attribute. Description focuses on finding comprehensive patters which can describe data like finding clusters or groups of objects with similar attributes.
1.2 KNOWLEDGE DISCOVERY IN DATABASES (KDD) Knowledge Discovery in Databases consists of 5 steps. It’s about revealing or creating useful knowledge through data analysis. It refers to the whole process, from data collection to utilizing the outcome in a more practical level. The basic stages of Knowledge Discovery in Databases are: 1 Data Collection 2 Preprocessing 3 Transformation 4 Data Mining 5 Interpretation/Evaluation The “Knowledge Discovery in Databases” term is often associated with the term “Data Mining” although it is just one of its steps. The basic goal of Data Mining (DM) is the extraction of unknown and possibly valuable information or patterns from data. Someone could say that the term is used excessively since no data extraction is performed. On the contrary, preprocessed and (possibly) modified data are used for extracting useful information, which is then used for solving a problem. Below you will find a brief description about each stage of KDD. 1.2.1 DATA COLLECTION The first step of KDD is the collection and storage of data. Data collection is usually performed automatically e.g. by using sensors or not automatically e.g. via a questionnaire. Malfunction in sensors or non-submitted answers on a
questionnaire could lead to incomplete data. These particular problems which might occur during data collection are faced by the next step. 1.2.2 PREPROCESSING The second and most important step of KDD is the preprocessing, with a goal of cleansing data: handling defective, false or missing data. Preprocessing might require up to 60% of the total effort since there is no reason to discuss about results if data are not clean and in the right form. On Chapter 3 we will examine throughout the processes from which preprocessing consists of and when each process should be used. 1.2.3 TRANSFORMATION Data Transformation is the third step of KDD. Basically, this step is about converting data under a common frame allowing us to edit them later. It is mostly used for smoothing data and removing noise, for data aggregation, for normalization or for creating new features based on the existing ones. Special forms of transformation are discretization and compression. 1.2.4 DATA MINING On this step of KDD an algorithm is used for model generation. Clean and transformed data are now ready to be used by an algorithm in order to create a model, usually for categorization or prediction. We want to use this model, created by some already known data, to get an answer about the value of an attribute-variable target goal for new, unknown data. 1.2.5 INTERPRETATION AND EVALUATION The last step of KDD is used to interpret and evaluate the results (not the model) produced by the whole process.
1.3 MODEL TYPES The models produced by the Data Mining step fall under two types: predictive models and descriptive models. The goal of a predictive model is to predict values for a specific interesting feature which could probably be based on the behavior of other attributes. For example, prediction could be based on data across different days of each week. A descriptive model finds patters or relations hidden inside data and examines their attributes in order to provide an explanation about their behavior.
1.4 EXAMPLES AND COUNTEREXAMPLES It’s hard for some people to understand and distinguish what KDD and DM are. That’s why we will have a look at some practical examples and counterexamples in order to make clear what DM is or isn’t. Some examples of DM are the following: After 9/11, Bill Clinton announced that after examining lots of databases, FBI agents discovered that 5 of the perpetrators were registered to these databases. One of them owned 30 credit cards with a negative balance of $250.000 and lived in US for less than two years. Telecommunication companies not only reward clients who spend lots of money but also clients named as “guides”. These guides often convince friends, relatives, coworkers and others to follow them when they change provider. So, telecom companies need to find these clients and make them stick to them, proving higher discounts and more services to them. By using data from older recorded temperatures during the summer season of the previous 15 years, we try to predict the temperatures for the summer season of the next 15 years. Data mining is not just the simple processing of queries neither small scale statistical programs. Some counterparts are the following: Finding a phone number from a phonebook Finding information about Paris on the internet Finding the average of exams grades Searching for the medical records of a patience with a particular disease, in order to further analyze his medical record.
1.5 CLASSIFICATION OF DATA MINING METHODS There is a wide range of data mining methods. Depending on the data types and the type of knowledge extracted, they are classified in different categories. Some basic methods of Data Mining are presented below. At this point we should mention that in Data Science education is accomplished by using data, whereas in other forms of education a teacher or a specialist transfers knowledge from one person to another. 1.5.1 CLASSIFICATION This is a predictive method. Its goal is to create a model – classifier based on current data. Basically, it’s the knowledge of a function which represents an object (usually represented as a values vector for its characteristic features) in a value of a categorical variable also known as class. Learning, is a behavior of intelligent systems which is studied by scientific fields like Machine Learning or Artificial Intelligence. Due to this, all these fields study the same problems, without this meaning that there are no other scopes studied individually by each scientific field. Classification is often associated with prediction. In classification, the outcome we want to predict is the class of the samples. A class can have discrete values from a finite set. On the contrary, during prediction with methods like regression, the variable-goal could be any real number. 1.5.2 REGRESSION Regression is a similar to classification process, whose goal is learning or else training a function which represents an object in a real variable. It is also a predictive method. By using some independent variables its goal is to predict the values of a dependent variable.
On the above image we can see an example of liner regression. The variables in this example are the square meters of a house and the selling price in thousands of dollars. Linear regression adapts a line in the samples of the dataset, shown in red X’s. It is created based on a distance function or price function, whose price we want to minimize. By having the optimal line (the line which minimizes the value of the price function) we can then estimate pretty accurately questions like: “Which is the selling price for 150 square meters houses?”. 1.5.3 CLUSTERING Clustering is a descriptive method. Given a dataset, the goal of clustering is to create clusters (groups with the same or similar features). In clustering the goal is to find a finite number of clusters in order to describe data. On the below example we can see the result from clustering medical data. Three clusters are created based on “dosage” and “influence duration”.
1.5.4 EXTRACTION AND ASSOCIATION ANALYSIS Association Rules Mining is considered one of the most important data mining processes. It has attracted a lot of attention since association rules provide a brief way to express the potentially useful information in an easy to understand way for the final users. These association rules discover hidden relationships between features of a dataset. These associations are presented in an AB form, where A and B are sets referring to the features of the whole data we analyze. An AB association rule predicts the appearance of features of set B given that features of set A are present. A classic example of association rules in practice has to do with the analysis of a shopping cart in a super market, where data have to do with clients transactions. In this scenario, some transactions could be {bread, milk}, {bread, diapers, beer, eggs}, {milk, diapers, beer, soda}, {bread, milk, diapers, beer} and {bread, milk, diapers, soda}. Some association rules on these transactions could be {Diapers} {Beer}, {beer, bread} {milk}, {milk, bread} {eggs, soda}. For example, the last rule reveals that it’s quite possible that whoever buys milk and bread might also buy eggs and soda. Extracting valuable conclusions through association rules, the marketing department of the super market could place its products in shelves more profitably, create better marketing campaigns and efficiently manage its resources.