Author:Vickler, Andy
R Programming : Data Analysis and Statistics is a beginner-friendly book. It is written in an accessible way, and deal with the basics as well as more complex problems.
Tags
Support Statistics
¥.00 ·
0times
Text Preview (First 20 pages)
Registered users can read the full content for free
Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.
Page
1
(This page has no text content)
Page
2
R Programing R Data Analysis and Statistics
Page
3
Table of Contents Introduction Chapter 1: R Programming Getting Started With R? R Basic Interaction Chapter 2: Reproducible Analysis in R Literate Programming & Integration of Docs and Workflow YAML Language Running Markdown R Code Chapter 3: R programming Data Manipulation dplyr Functions tidyr Data Tidying Exercises Importing Data Chapter 4: R Data Visualization Data Graphics Ggplot2 Chapter 5: Large Datasets Data Analysis and Statistics Exercises How to subset a large dataset Chapter 6: R Supervised Learning Machine Learning Overview Approach Examples JMP SPSS (R Statistical Software) Rapid Miner Oracle Business Intelligence Enterprise - Oracle Data Miner Citrine Exercises Polynomials Classification Measures Analyzing Breast Cancer Chapter 7: Unsupervised Learning
Page
4
Semantic Tensor Rotation Human Brain Clustering Probabilistic Rules Association Rules Exercises HouseVotes84 Data Rescaling Project 1 Import the data for the analysis Model fitting Chapter 8: Expressions Expressions Arithmetic Boolean Data Types Numeric integer complex logical non-logical Character Data structures Control Structures Looping Functions K Smallest Element Chapter 9: R Programming – Advanced Vectors Vectorizing Functions The Apply Function Advanced Functions Infix Operators Functional Programming Function Operations: Functions as Input and Output Exercises Apply_if Power Factorial Again
Page
5
Chapter 10: Object-oriented Programming Objects & Polymorphic Functions Data Structures Bayesian Model Classes Polymorphic Functions Exercises Conservation and Efficiency Chapter 11: Building an R Package R Package Writing functions Testing Package Names R Package Rbuildignore Linking Linking to other R packages Requirements Adding Data Exercises Chapter 12: Testing Unit Testing Automating testing Test Structures Random Numbers in Tests Consistency Exercises Chapter 13: Version Control Version Control Repositories Github Forking Exercises Chapter 14: R Profiling Profiling R Optimization R Bayesian Linear Regression R Model Matrices
Page
6
Exercises Fitting General Models BLM Testing Conclusion
Page
7
Introduction R is the second most popular programming language used in data science. It provides data analysis and statistical methods for the masses, allowing you to produce sophisticated graphics quickly, perform calculations, and crunch numbers. Over 100 million developers worldwide use the R Programming Language, making it one of the most widely used tools in computer science. It is free and open-source, enabling developers to customize the way their program works. Most statistical analysis tools are challenging to use, and they often require a particular degree of knowledge or experience. It would be best if you had at least a bachelor’s degree in mathematics to use them effectively. On the other hand, R is an easy-to-use tool perfect for beginners who have not yet developed that expertise. The programming language is ideal for analysts who want to save time and have a more effective tool. The statistics command used in R are the same as those used in most other programming languages, and they only require different syntax. The fact that R can be easily used to perform calculations and run programs means that more than 80% of the world's investment banks are currently using it. R was initially developed by Ross Ihaka and Robert Gentleman at the University of Auckland in 1993. Since its inception, it has been used for various research projects in science and technology departments across the globe. Over time, several other programmers have offered their time to contribute to the R project, improving its functionality even further and making it more user- friendly. The latest R version (version 3.4) was released on June 30th, 2017. The R Foundation is the organization behind R, and these days various industry leaders and companies sponsor its projects. R is a free and open-source project, meaning that its source code can be viewed and modified by anyone. The R development team can use the product’s statistics to improve its functionality. Many companies around the world use R for their data analysis needs. The popularity of R makes it a powerful tool with thousands of packages that can be loaded onto your system. Data science has evolved into an industry with a large usage rate on this platform as businesses struggle to analyze vast amounts of information generated by new technologies such as digital cameras, Internet sensors, and social media platforms where individuals are allowed to publicly display their thoughts and actions in real-time while their identities remain anonymous. In this article, we will explain how to use R for data analysis easily and effectively by showing how to perform simple calculations and data manipulations with R. As a result; we will be able to analyze datasets from diverse fields such as economics, biology, health, marketing, and psychology in an effective manner. For those of you who are completely new to R or the subject of statistics, I encourage you not to worry if you find some parts of the text seemingly incomprehensible; after all, even researchers in these fields find it difficult at times. This article is meant for those who do not have any formal computer science or mathematics training to provide a gentle introduction to the subject. For those who have used R before, the rest of you should know that R contains an extensive package of functions that can be called from the command line or through an interactive interface. You can also save your work to a file and load it back into the same environment. In
Page
8
this way, you can rewrite your code multiple times to improve its performance and accuracy. The package offers a variety of functions for manipulating and analyzing data in a very intuitive manner: for instance, you can search for a specific value in your dataset with just one line of code; perform calculations without having to determine all their possible forms; or plot charts using mathematical vectors and labels.
Page
9
Chapter 1: R Programming R is a programming language similar to other languages used for statistical analysis and data science, such as C++ and Java. It can also be used as a scripting language for automated statistical modeling. Additionally, R is open source and free, making it a very accessible tool for data analysis; it can easily be downloaded onto a computer. All of its source code is available on the Internet. However, as with any programming language, there are syntax rules that one must follow to write correct R code. Furthermore, some statistical models require data preparation steps (e.g., imputation of missing values) before being used in the model. Although R has many functions that can do these tasks automatically, they are not entirely accurate most of the time and need to be checked before being used as part of a statistical model. It may even be necessary to apply a model correctly. For example, the imputation of missing data is done automatically by default in many statistical models. However, these imputed values can introduce bias into the results because they have not been adjusted for possible underlying non- sampling errors. Getting Started With R? R is a powerful tool for data analysis; however, it can be time-consuming and frustrating to use without the proper knowledge. Luckily, many resources are available to learn how to use R for data analysis and statistics correctly. These include books and documentation. Additionally, learning tools such as RStudio and Data Camp can help users learn how to use R for statistical analysis. RStudio is a user-friendly interface for creating and managing R projects. It includes many powerful tools and functions used in R projects, such as code completion, syntax highlighting, and encryption. This can make data manipulation more efficient because typing each function is no longer necessary. It also provides an easy-to-use interface for creating, editing, and managing R datasets for analysis. Data Camp is a website designed to help users learn to use statistics by providing interactive lessons on statistical concepts. The site has interactive exercises that allow the user to complete tasks without worrying about the programming aspect of statistics; it even translates these commands into R code for automatic execution. This platform allows users to write code, run models and plots, and view data all in one place. It also has many features that make writing code easier, such as syntax highlighting and auto-completion. Learning the basics of R through an online course is another great way to get started with this programming language, and the other resources mentioned above can be very helpful in this regard. For example, Data Camp has a course designed to teach R statistical programming. The course takes a step-by-step approach and starts with installing R on one's computer and writing the first command in R. It then progresses to more advanced topics, such as writing functions and using packages. The big difference between R and the other languages is that they focus on statistics rather than design, architecture, or other features. Ultimately, you may find the learning curve steeper because of how R is laid out. This article will illustrate using the R programming language with some examples of data analysis tasks to get started on your projects.
Page
10
R Basic Interaction The R language is a popular programming language for statistical analysis, and it has many built- in commands we can use to help us run our data analysis. These commands are called functions, which allow users to write their code and create customized algorithms. This article will go over some of the most basic functions in the R language to better understand how they work and how they can aid us when conducting our research. This article will introduce some of the most common R data analysis commands. Some of these include: Summary (): this command will generate a statistical summary of the data such as mean, median, and standard deviation, among other things Mean () and median(): these two commands help you determine the average or median of your data set. Sd (): this command helps you find out the standard deviation. t(): this command helps you determine the student's test statistics. cor(): this command helps you find a conditional correlation or conditional regression coefficient Range (): the range helps return a vector of your data set's minimum and maximum values Hist (): this command helps you find the histogram. summary2(): this command will generate a graphical summary of your data set + One-way ANOVA table(): this command is used to create an ANOVA table of your data set, which shows F values and p values of the analysis Some additional commands: tzplot(): draws a box plot for 2D and 3D coronal plane distributions, similar to the plot() function in R using the default graphics but including the potential to add annotations with custom text and overprinting of axes on graphics plots (refer to Section 5.1 for more information). ggplot: a user-friendly approach to creating plots in R, including a very popular graphics library called "ggplot2". grid: automatically adds grid lines to your figures; it can help identify the statistical significance of one or multiple variables in your data set. Other R functions: "base" for base (logistic) regression, "car" for car (binomial) regression, "coda" for Cox's proportional hazards model with time to event data and survey-based survival analysis and so on. Difference between analysis and data science Before starting with some practical R data analysis, let's look at the difference between data analysis and data science. The analysis is the first step we need to take before data science. Data science is about extracting insights from our analysis to make better decisions based on what we learned from our findings. Data scientists look for meaningful relationships and trends between different variables, while statisticians concentrate on p-values, confidence intervals, null hypothesis testing, etc.
Page
11
A good statistician will also be a good data scientist because he/she has a very good understanding of all these statistical tests. Data Visualization in R There are many ways to visualize our data. In R, we have many built-in functions to help us with this task. One of the most popular graphing techniques is called the scatterplot. A scatterplot is a graph that plots two numerical variables against each other, the x-axis is the independent variable, and the y-axis is the dependent variable. You can use the plot() function to plot data in R using the -x argument to plot one column of data, or you can use the argument to plot multiple columns of data using different colors. Data Input and Output R can take data from many different sources, including Excel and text files, called CSV files. We can use the "input()" command to read the data from these different file types into our workspace. The data frame is used to organize data in rows and columns, and it is a very popular way of organizing and storing data. Basic Statistics in R We can get much good information out of our data just by knowing how to use the basic statistics functions that come built-in with R. The "summary()" command helps us get a quick statistical summary of the data. We can also use the "cor()" function to find conditional correlation or regression. We can also use the “mean()” and “median()” commands to find out the average or median of your data set. The standard deviation ("sd()") is another very useful statistical tool that we can use to measure variability in our data. The student t's test statistic ("test()") is another handy statistic for checking if there is a significant difference between two means; one-sample or paired-samples t- tests are built-in in R. The correlation coefficient (“cor()”) is another very useful statistic in R that we can use to measure the relationship between two variables. Correlating more than one variable at once Finding out the correlation between more than one variable at once is very common for data researchers. For example, imagine that you have two correlated variables (Y1 and Y2), but you would like to know the correlation between Y1 and Y3. The "cor()" function allows us to do this by returning a vector of values with plus "1" or a minus "1" depending on whether the two variables are positively or negatively correlated. You can also use the "add1()" function to add one after your values and then use these new values to correlate another variable with your original variables. You can also find out the correlation between three variables at once by following these steps: 1. Enter the first correlations using “cor(Y1, Y2, Y3)". Only two will have a plus or minus, indicating the correlation value.
Page
12
2. Add 1 to each of these numbers using “add1(cor(Y1, Y2, Y3), 3)”. 3. Go to the next step; you can use this new set of values to correlate the third variable with Y1, Y2, and Y3. 3. Go to the next step; you can use this new set of values to correlate the third variable with Y1, Y2, and Y3. How to do Multiple Regression in R? Multiple regression is when one or more independent variables predict one dependent variable. It is commonly used when our dependent variable is continuous (e.g., income or weight). The "lm" command can perform multiple regression in R. How to do Logistic Regression in R? Logistic regression is a type of regression used to study the relationship between a categorical dependent variable and one or more independent variables. It only predicts values between 1 and 0, where one means something is predicted to happen, and 0 means it will not happen. The "glm" command can do logistic regression in R. How to do Chi-Squared Test in R? The chi-squared test can test the null hypothesis against an alternative hypothesis, and it is a very popular technique for testing independence between the two variables. We can use the "chisq. test()" command to run it in R. How to do Pearson’s Correlation Coefficient in R? Pearson's correlation coefficient is a common statistical tool that measures how two linear variables are related. We can use the command "cor()" to output a correlation coefficient between variables Y1 and Y2 from our data, or we can also use the command "cor. test()" to output a correlation coefficient from two samples. How to run a stepwise multiple linear regression in R? Stepwise regression allows us to select variables based on our prior knowledge or statistical criteria until the final set of variables is chosen to correlate with our dependent variable. We can use the command "step()" to carry out this process. How to run a partial correlation regression in R? The partial correlation coefficient measures the amount of the original relationship between two variables not explained by other variables in our data. This can be very useful because it allows us to see what variables may be affecting our relationships and which ones are not. The "points" command can be used for this purpose, and we can use the command "lm" to perform a multiple regression after running the points command. Categorical versus Numerical data, what should you use? There are two main types of data, numerical and categorical. Numerical data has a specific order among the numbers, for example: {1, 2, 3}. Categorical data however has a different order, for example: {1, 2,A, 2,3}. Sometimes numerical data is called ordinal data. The best way to decide whether you should use categorical or numerical data is to think about how your data will be analyzed. If your numbers are on a linear scale (e.g., height) or they are measured in intervals (e.g., temperature), then you should use numerical data because you can perform statistical
Page
13
analysis with them using the "summary()" command and the various distribution commands that come with R. If however your data is on a nonlinear scale (e.g., weight) or they are measured in categories (e.g., city), then you should use categorical data because you cannot perform statistical analysis with them without converting them first to numerical data.
Page
14
Chapter 2: Reproducible Analysis in R Data analysis and statistics have been profoundly changing regarding the tools and techniques used during the last decade. Computer scientists have produced increasing reproducible research thanks to many free software packages. Reproducible research is essential because it measures trustworthiness that other studies cannot provide. R package contains a set of software tools (called “packages”) that can be used to analyze data and produce statistical reports. For instance, there are packages for data manipulation, data visualization, analysis of datasets, statistical modeling and simulation. There are also packages dedicated to specific tasks such as spatial analysis. However, because packages are independent of each other and provide different functions, it is possible to find redundant functionality. Therefore, package developers need to work together to create integrated solutions that provide a single user interface for the users. A well-defined set of instructions is necessary to reproduce the results shown in a published scientific report. Reproducible research requires this information to ensure that others can replicate and verify the results obtained. It also encourages communication between scientists by providing a common ground on which they can communicate since they have access to the same information and tools. These tools could be programs available via the Internet (packages), hard copies or electronic code copies, etc. In particular, R packages provide a convenient and flexible way of sharing methods and results. Creating an R package begins by installing the latest R version on a suitable computer and choosing a directory in which the package will be created. This directory contains the source code written using a text editor (such as Sublime Text or notepad++). To create an R package, it is necessary to set the main directory as an environment variable called "R_LIBS" or "BASE" and then to create a file called "DESCRIPTION" that contains information about the package such as its title, author, license etc. In addition, three main steps need to be completed to create a package. First, an R script file called "R-package. R" has to be created, which contains the actual code written by the author; second, a directory named "man" has to be created with text files inside it and finally, a file named "README.Rmd" which is also stored in the "man" directory has to be written as well. In this file, the author should include information about the purpose of the package and how to install it and use it. Literate Programming & Integration of Docs and Workflow Data science has become a dominant force in the modern world. One of the main benefits of this movement is that it pushes us to document our workflows and package them into executable documents or scripts. This is what we call Literate Programming. With R and literate programming, we can easily create rich reports with graphics and statistical summaries without having to copy-paste code around different files as we go along. R is an environment with many more features than just calling R programs from your programming language. One of these features is custom data types that allow us to create our data types, either in R or your programming language. Since users can have different needs when it comes to computing, R provides different packages that can be imported into their
Page
15
environments (even on Windows). One of the most popular of these packages is the panda's library, which provides data frame objects that can be used to create tables and graphs. The idea of literate programming has been pushed by Sweave, an extension to R that allows us to write markdown documents and embed code chunks where necessary. The package takes the code and the markdown document and outputs an HTML file. This package has been replaced by knitr, an enhanced version of Sweave with many new features. To work with R from your programming language, you need to install the rcom package in your environment. It provides a common interface for most languages designed specifically for data scientists and statisticians who use R as their main data analysis and exploration tool. The room package contains functions for working with data frames, plotting, etc. It has many other features that can be used to allow users to integrate R into their applications. This package needs a way to run the code in your programming language on the R environment when we write our documents. This is where the R package comes in handy. Like the room package, it has functions and functions for working with data frames or pandas objects or any other object that can be passed into an R environment from your programming language. Once you have all these requirements, you can start writing your documents. You will first write a code chunk in R and then place the markdown text below it that describes what the code does. As you write your document, you will notice that the rich text editor in R studio allows us to preview the output of our code while we type it. This helps us see if there are any issues or mistakes in our code before we run it. For example, we want to create a graph of simulated data with standard normal distribution and different sample sizes. You can use the following R code to generate this data set. <source lang=’r’ type=’markdown’><!--- DO NOT RUN THIS CODE IN R - �� Library(rcom) # Create dataframe with 10 simulated data sets data_df <- rnorm(10, sd=1, mean=2) # Plot a scatter plot of the data as a heatmap heatmap(data_df, x = dispersion != 0, y = dispersion == 0.5) # Plot a boxplot of the sample standard deviations boxplot(data_df$dispersion, labels=c(“0”, “1”))</source> We have the code written in R and the text that describes it. We can run the code in our programming language and see the output of the code. In this case, it creates a box plot of all 10 data sets. After the complete text description, you can write another chunk of code manually or use knitr to create your report. For example, let's say we want to create a table that shows some summary statistics for each sample size from 10-100 in each data set. We will use the labs() function from knitr that returns columns using labels for each data set and then sorts them accordingly. We can return the summary statistics and write a few lines to show the table's contents. Library(rcom) # Create dataframe with 10 simulated data sets data_df <- rnorm(10, sd=1, mean=2) # Plot a scatter plot of the data as a heatmap heatmap(data_df, x = dispersion != 0, y = dispersion == 0.5) # Plot a boxplot of the sample standard deviations boxplot(data_df$dispersion, labels=c(“0”, “1”)) # Run summary statistics sum(data_df$dispersion) sum((data_df$mean – data_df$sd)^2 / n()) colnames(data_df) <- c(“Sample 1”, “Sample 2”, “Sample 3”, “Sample 4”,
Page
16
“Sample 5”) data_grid <- data.frame(c(‘sample’, ‘sum’, ‘mean’)) dataDF <- rbind(dataDF, dataDF, dataDF, dataDF, dataDF, … dataDF, dataDF, dataDF)) # Create a table with the summary statistics table(data_grid) # Insert images from examples that illustrate summary statistics insert_image(“example”, “example.png”)</source> Once you have run this code in your programming language and have created the table using the knitr functions, you can save it all in a .pdf or .html file. You can then send this document to your client and ask him/her to open it in R studio so that he/she can see exactly what you have described. YAML Language The YALM programming language promotes "humane" or intuitive programming. YALM stands for "yet another markup language." It was created with the primary design goal of making readability in code a top priority. We can make coding more enjoyable and less like work by prioritizing readability. A string is written as a sequence of characters surrounded by single quotes. Strings are useful for storing text, such as names, telephone numbers and addresses. Strings can also store other data types, such as numbers. Strings are used in various contexts throughout the language, including the declaration of variable names, the parsing of command-line flags, and YALM's built-in string manipulation library. A "native" string type also holds text encoded in UTF-8 format. This native string type will work without error for all YALM source code valid UTF-8 text. Native strings on an ASCII machine will result in an identical size to ASCII strings because the encoding is ASCII compatible. YALM uses a simple integer type system based on octal and hexadecimal encodings. Oct is an encoding of integers that start counting at 0 and end at 7. The bits (1-8) of a character or a number are called "nums." Each num represents 2 powers of 10 (2^10 = 1024). The following code prints the first 100 octal numbers in ascending order: Print(“I have “ + numlist(100) + “ octal numbers!”) Numbers in YALM are always written as an integer, even if they are decimal. So although 1.2 is a decimal number, it must always be written as 1.2 in YALM code, or the program will fail. The hexadecimal system is similar to the octal system except that it starts counting at 0x, for extended, rather than 0. It counts (0x-0xf) then continues counting from 10 (10-17). Hashes (#) surround hexadecimal characters rather than commas like the octal system. Hexadecimal numbers in YALM are always written as an integer, even when they are decimal. This can lead to some confusion when considering the exact value of numbers: Print (“I have “+ numlist(“0x1b5a”) + “hexadecimal numbers!”) In short, we see that the octal and hexadecimal systems behave similarly for base three but in a slightly different way for base 2.
Page
17
A number's type is determined by which nums its bits define. For example, decimal numbers with more digits must be written in octal. Some decimals do not exist (e.g. 0.64, 0.2); these are written in hexadecimal (0x64 and 0x2). Numbers with less than nine digits can be written in either system. The following code prints the first 100 decimal numbers: Print(“I have “ + numlist(100) + “ decimal numbers!”) YALM provides a simple library of mathematical functions that can be used to manipulate strings and numbers. YALM also provides built-in support for defining and using classes and structs. A class is an object with its own set of methods and variables. A struct is a non-class type that only has its methods, variables, and values. The following code shows how the "string" class can be extended to support non-ASCII text: Class string() { public output(“This is a string”) } Class string_encoding() { public native(“This is a non-ASCII string encoding”) } Here are some examples of using arrays and hashes to store lists of strings: Print(“I have “ + numlist(100, array(“Hello”, “World!”)) + “ list elements!”) print(“I have “ + numlist(100, hash(“hello”, array(“world!”))) + “ list elements!”) Print(“I have “ + numlist(1000, array()) + “ subroutine arguments! (The first one has no value, the rest are empty strings. Arrays only take strings.)”) print(“I have “ + numlist(1000, hash()) + “ subroutine arguments! (The first one has no value, the rest are empty strings.)”) YALM's basic data types are immutable. Unlike in most other languages, changes to an object will not result in a new object unless the C++ operator explicitly creates the new object "=" or there is an exception where it is necessary to access something on an object that has not changed. This makes memory management easier for programmers who do not intend for any particular object to be mutated. This concept extends to arrays, hashes and structs as well. YALM uses a library of built-in methods and functions implemented in the C++ programming language. The following example parses a simple command line parser and prints out the results: @argc = 0 @argv = “” while (commandline()) { if (whoami() == “root”) { echo(“Access denied!”) exit(0) } else if !(!fileexists(“/etc/passwd”)) { echo(“File not found.”) exit(0) } else if !(!fileexists(“/etc/shadow”)) { echo(“File not found. While(argc < 2 && argv == “” || argv == “”) { print(“Invalid command line options. Try `yalminstall.sh –help` to see a list of supported commands.”) while (commandline()) { } } else if (argc >= 2) { @str = string(argv(1), “!”) + string(argv(2), “%”) @num = numlist(stringlist(@str)) print(“So the file was last accessed on: “ + timestamp() + (“ and it was accessed for the first time on: “ +
Page
18
timestamp(_num.from)) + “ in: “ + (_num.from) + “ seconds.” ) } else { print(“Usage: yalminstall.sh [-s] –help -d [<filename> | “-f <filename>”] <string> … [option] …”) print(“-s”: print string details of types supported by the system.) … } } else { echo(“What command line options are you running? Try `yalminstall.sh –help` to see a list of available commands.”) } } YALM provides many built-in data types that can be used for its source code, including lists and strings (both as arrays/hashes), floats, integers and booleans. Integer types include “i8”, “u8”, “short,” “int,” “long,” and “long.” The following example shows how integers can be used to create a simple counter: Cnt = 0 while (true) { ++cnt } print(“I’ve done things 1234 times.”) To define a new type in YALM, create a subclass of the existing class "generic." The language specification for generic contains many more examples to help programmers write complete classes. YALM has a built-in class called object, which handles input and output. It's very useful because it can be used to format any data, including custom data types. The following example shows how one could use this class to display a directory listing: Object o = dir(“.”) if (o) { while (readline(object)) { print(“ “ + _text + “ “) } } else { print(“Directory not found.”) } The language specification for an object contains many more examples to help programmers write complete classes. It also includes many detailed descriptions of the methods in the class itself. The drawback with ALM language is that YALM's design does not support all of the features some programmers may want. For example, it does not have any built-in support for pointer arithmetic or bit manipulation, making it difficult to write efficient programs that perform calculations on data structures. The following example shows one way that YALM could implement pointer access: @str = string(path(“/etc/shadow”)) While (readline(object)) { #until found gap size object.offset = object.size – object.offset print((object.offset) + “/” + _text) print((object.offset + object.size) + “/” + _text) If (object.offset > object.size) { break } If (offset > 0 && offset < object.size && strcmp(readline(object), “</string>”)) { #increment pointer by one size object.offset = object.size – 1 print((object.offset) + “/” + _text) } else { #no more data break } } else { print(“No data found.”) } The language specification for an object contains many more examples to help programmers write complete classes, including how to define new types. It also includes many detailed descriptions of the methods in the class itself.
Page
19
Running Markdown R Code Markdown documents can now contain R code chunks. To use this capability, you must have the MARKDOWN_MATHJAX package loaded and the corresponding mathjax plug-in for your browser installed. It is possible to write markdown documents that include code snippets in R and render them live in a web browser. The new feature includes interactive plots, charts, etc., directly in blog posts on sites like WordPress or Medium without using Markdown files and a web server that supports Markdown rendering, for example, WordPress. Markdown is a lightweight markup language that provides a convenient way to write structured documents in plain text. It contains three major components: Lists, Headings, and indented text. In this example, the first item in the list is a bullet called "The Happiest Countries," followed by a numbered list with four items for every five countries. The second item in this list is "The USA," followed by an unnumbered block containing two items. The third item in the list is "Sweden," which has code block formatting, including keywords like Headings and 2-level lists. Markdown can be used to produce programs that parse and render Markdown documents, and this makes it useful for automatically creating web pages from Markdown documents that contain programming code. The new R Markdown package contains a parser and renderer that can be used to convert Markdown documents into R code. In its simplest form, the input to the parser is a document written in Markdown format. The input is then parsed into sources using a parser function called parse text(). We use text() for strings and quote() for quoted text. A parse text() call produces all of the items that are inside lists, as well as code chunks (if there are any) and blocks with embedded math (if there are any). The output from the parser is a list of the inputs to each render function. Each render function performs a particular action on the input and returns a new output (in this case, an R object with graphics embedded in it). We can produce graphics of markdown documents in various ways. The render() function simply applies the graphics-rendering capabilities of the R graphics system to generate a PNG or PDF file. The print() function displays PDF files natively with an installed version of xpdf or by piping output through xargs (this is useful when you want to include images in your documents). The writer package provides simple functions for inserting text directly into any media type, including HTML. Ultimately, the rendering process produces R code and a set of output files. In this example, the plot is embedded in a . Rmd document. The Red file can then be compiled with knitr to produce a stand-alone HTML file that includes plots and other content written in Markdown syntax. You can also use the roxygen2 package to insert inline R documentation that will render documentation using rmarkdown::renderVignette().
Page
20
Chapter 3: R programming Data Manipulation Data manipulation is a process used by Statisticians, statisticians, and Data analysts to organize and analyze data. This manipulation can be done in various ways, including programming languages that manipulate the data with scripting. It is usually done using programs such as R or SAS, but some may also use spreadsheets or other applications. Data manipulation helps us get to know what we are looking at through a more thorough analysis of that particular dataset. Data manipulation can be done in a variety of ways, but there are two main ways: 1. “Automated” data manipulation 2. “Manual” data manipulation Automated data manipulation can be used for different uses, such as identifying outliers, comparing the sizes of datasets, or identifying relationships using the Pearson product-moment or Spearman correlation coefficients. The two main methods used in automated data manipulation are: 1. “Projecting” data “Projecting” data involves taking all of your available data, putting it through a function (or set of functions), and outputting the results. This output is often a new dataset, and the original dataset is thrown away once the process is complete. 2. “Filtering” data "Filtering" data involves combining all of your available data into a new dataset and removing any points outside certain acceptable ranges. For example, you can determine how many points lie within specific intervals on your x-axis. (This is done by plotting the entire axis on a new dataset and using the value 500 as an interval boundary). Filtering datasets may be used, for example, to identify outliers or unusual values in datasets. Data projected by automatic programs is one set of numbers processed with a certain set of instructions and is ready to be used as input in another procedure that has its own set of instructions to perform some operation on it (for example, filtering). Formatting Manual data manipulation is generally easier to use but requires some practice with programming and scripting languages. The benefit of this method is that you can also adjust these data in your analysis, which cannot be done with automatic methods. Some manual data manipulation can be done using the R language's "data frames." The best way to format this data is to create a new dataset with only your new variables. Also, you will have to save your new variables in the same "format" as the original data. Data manipulation is also done with formatting, usually to ensure that columns align correctly and look neat. This can be done in several ways, but the most common is to use the paste() or + sign. This pasting can help arrange data in tables, rows, or columns and make them look more visually appealing. Automated data manipulation requires very little formatting and is often done right on a spreadsheet or similar program. The downside is that it requires more time and technical knowledge than manual methods, and you cannot change your results.
Comments 0
Loading comments...
Reply to Comment
Edit Comment