Learn R

Introduction
Downloads and Installation
Basic Commands
R Functions
Packages
Practice


Download the pdf of this tutorial

Before you begin, download the practice dataset.

Introduction

What is R?

Very few researchers compute statistics by hand. Most make computers do their statistics for them. Also, some statistical routines are impossible to do without computers.

Several programs are available for data analysis, including SPSS, Excel, SAS, Matlab, Minitab, and R. Several of these programs have a familiar and easy to use interface (such as SPSS, Excel, and Minitab)–in these programs you simply point and click in order to analyze your data.

However, ease of use often comes at the expense of flexibility. Some programs (such as Excel or Minitab) only allow the user to do basic statistical analysis and data manipulation. However, the “big dawgs” like to use programs that are much more flexible, such as SAS and R.

Programming Statistics

While Excel and SPSS have point and click interfaces, R does not. Instead, the user must program their statistical analysis. For example, to perform a regression analysis in Excel is a simple matter of pointing and clicking. R, on the other hand, must be told through a computer language how to perform the statistical analysis:

lm(y~x, data=testData, subset=(group==1))

Don’t worry about what interpreting the code just yet, but you can see that things become a little more complicated when the user must code their statistical analysis rather than pointing and clicking. However, with that added complication comes added flexibility.

However, explaining what R is, it’s probably better for you to dive in. In the next chapter, we will show you how to perform some rudimentary R operations to give you a feel for how to “speak” the R language. After that, we’ll dive in to some more advanced R programming.

What This Tutorial Is and Is Not

This tutorial does not provide comprehensive documentation of the R language. Instead, it’s a brief introduction to the language to help you get your “feet wet.” I have learned several programs in my lifetime, and I’ve found the best way to learn them is to have a short introduction, then spend the rest of the time messing around with it. So, with that, let’s wet those feet of yours!

Installation

R is available for download for free from CRAN (mac download | pc download). To download it, first visit the website, then click on the link that says, “Download R For…" Since I’m working on a Mac, I clicked on the second link.

[[insert image]]

Next you will be taken to a screen where you can actually download R. Click on the top-most link (make sure you have an operating system that meets the minimum requirements). If your OS doesn’t meet the minimum requirements, then choose a file that does.

[[insert image]]

Then you wait! The download shouldn’t take too long. After it downloads it, open the file and allow R to install on your machine. Congrats! You just installed R!

The R Interface

When you first open R, it will look like the figure below. You could start using R at this point by typing in commands at the point where the > is located. Let’s go ahead and try that.

[[insert image]]

The first thing that we’ll do is create a variable called hello and give it the value of 107. To do that, we would do the following

hello = 107
hello
## [1] 107

Notice how it returned the value “107” back to you. All you did was create something (we’ll call it a variable) called
“hello” and assigned it a value of 107. Now let’s try something else. Let’s create a called weight:

 weight = c(143, 137, 137, 131, 129, 125, 125, 124, 124, 120)
 weight
##  [1] 143 137 137 131 129 125 125 124 124 120

Suppose that this represents your weekly weight over a 10 week period. (Kudos to you! You’re learning R and losing weight!). So here’s what we did–we created a variable called weight, but rather than holding a single value, it actually contains 10 values. This is called a vector. That’s why we put “c(…)” there– stands for `concatenate’ which is is an overly technical way to say, “I put a bunch of things into a single object.”

Since now contains several numbers, we can start doing statistics on it. For example, we may compute the mean of it.

mean(weight)
## [1] 129.5

(don’t forget to hit enter). Or we could compute the standard deviation

sd(weight)
## [1] 7.367

or the median

median(weight)
## [1] 127

Or we could plot it

plot(weight)

At this point, hopefully your results matched mine. If so, good job! You’ve just done your first bit of R programming.

Now go ahead and close the plot window (after spending several minutes admiring it!) Now let’s suppose that you should have put the value 135 in the third position, not 137. Oops. To correct it, there’s several things you could do. Here’s one way: you push the up arrow button six times until R shows the following:

weight = c(143, 137, 137, 131, 129, 125, 125, 124, 124, 120)

Then you could click on the third 137, change it to 135, then hit enter. Go ahead and do it. In this example, that was easy because you noticed the error only 6 steps into your code. But suppose you had done 100 things between creating the variable and noticing the error. You’d have to push the up-arrow button 100 times! So, let me show you a much better way of keeping track of your code and keeping a better record of what you’re doing.

The first thing we’ll do is hit command-N (if you’re on a PC, you’ll have to go to the file menu, then click on “New script“). Notice that a new window pops up.

[[insert image]]

The good thing about this second window is that it allows you to write code, save it, edit it, then selectively run it. Let me give you an example. On the right window (the one that you just opened), write the following code just as you did before:

 weight = c(143, 137, 137, 131, 129, 125, 125, 124, 124, 120)
 mean(weight)
 sd(weight)
 median(weight)
 plot(weight)

Then go ahead and hit command-S (or control-S) to save the document. Save it somewhere you’ll remember. I saved mine in my documents folder and called it “weightloss." When you save a script (which is what this window is called), then you can access and run it later. Now that it’s saved, let’s go ahead and have R run the script. To do so, click the mouse anywhere on the first line and hit command-enter (on a Mac) or control-R (on a PC). Notice that on the left window it has the following:

 weight = c(143, 137, 137, 131, 129, 125, 125, 124, 124, 120)

So, what just happened? What you did was create your code in the right window, then sent that line to the left window. To run all the code, click on the right window, hit command-A (or control-A) to highlight all your code, then hit command-enter (or control-R). Notice how all of your code is run simultaneously.

The good thing about this method (i.e., the method of having two windows) is that you have one window where you can edit/write your code and the other window is dedicated to running the code. That way, if you need to make an edit, you don’t have to keep pushing the up arrow everytime you need to change something. Instead, you edit it in the right window, then execute the code (by pushing command-enter or control-R).

Basic Commands

Scalars, Vectors, and Matrices

Let me start by informing you how much you know. You already know what a scalar and a vector are. Remember the variable \(hello\)? That was a scalar. A scalar is a variable that contains only one value (in that case it was 107). The variable “weight" on the other hand, was a vector–it contained multiple values. There is still one other type of variable: a matrix. Aside from being a deceptive reality that people like Neo and Morpheus battle to overthrow, a matrix also has special meaning in R. When you think of a matrix, think of a spreadsheet. If you’re unfamiliar with what a spreadsheet is, think of a table with columns and rows. Vectors don’t have columns and rows, it’s just a list of numbers. Matrices, on the other hand, do.

Let’s try to get a visual of what a matrix looks like. In your file that you created (the one that I called “weight”), write the following after the line that plots the data

##      [,1] [,2]
## [1,]  143  125
## [2,]  137  125
## [3,]  137  124
## [4,]  131  124
## [5,]  129  120

After that, highlight the lines that you just wrote, then hit command-enter. On the left window, you should see the following

 weightData = c(143, 137, 137, 131, 129, 125, 125, 124, 124, 120)
 matrixWeightData = matrix(weightData, nrow=5, ncol=2)
 matrixWeightData
##      [,1] [,2]
## [1,]  143  125
## [2,]  137  125
## [3,]  137  124
## [4,]  131  124
## [5,]  129  120

Before you get overwhelmed by what I just did, let me talk in R. Here’s what R heard, “Alright R, I want you to create a new vector called weightData. Give it the values 143, 137, 137, 131, 129, 125, 125, 124, 124, and 120. Then, R, I want you to create a matrix called matrixWeightData that contains the exact same information as weightData, but I want you to give the matrix 5 rows and 2 columns. Got it R? Ok, then I want you to show me what matrixWeightData looks like. Now, go!”

Notice that both the variables “weight” and “matrixWeightData” contain the exact same information. The only difference is that weight is a vector, and matrixWeightData is a matrix. So, now so you can see them right next to each other, type the following. Notice how I’m typing it into the left window because I feel no need to save this information, but you can if you want to.

weight
##  [1] 143 137 137 131 129 125 125 124 124 120
matrixWeightData
##      [,1] [,2]
## [1,]  143  125
## [2,]  137  125
## [3,]  137  124
## [4,]  131  124
## [5,]  129  120

Again, the variables have exactly the same information, but one is contained as a vector and the other is contained as a matrix.

Reading in Data

You should have received a dataset called “workout.csv.” Store that file in the same folder that you saved your “weight” script in. Now, make sure R is open, then I want you to select click on the menu called “Misc” then click on “Change Working Directory” just like the picture below.

[[insert image]]

Next, navigate to the location where you have been using your R files. For me, that was located in the documents folder. If you’re on a PC, you’ll have to first click on the left window, then click on File->Change Dir…

Here’s what you just did. By default, if you ask R to find a particular file, it will search wherever its default directory is. That’s a problem when it doesn’t default to the folder where your file is. All we did was change it’s default directory. Now when you tell R to find the file “workout.csv," it will know where to find it.

Now that we’ve changed the default directory, let’s tell R to import the file. I’ll be using my right window so I can edit it later. You can continue using the same script that you used before, or you can create a new one.

weightLoss = read.csv("workout.csv")
head(weightLoss)

Here’s what I’m telling R for the first line of code: “Hey R, in the default directory you should find a file called `workout.csv.’ Open that file and put all of its contents in a variable called weightLoss.”

The second line of code (i.e., head(weightLoss)) simply tells R to return the first 7 rows of the data. Now if I run the code, the left the window will show

weightLoss = read.csv("workout.csv")
head(weightLoss)
##   ExerciseHours WeightLoss
## 1           6.1        2.7
## 2           4.8        2.7
## 3           4.6        2.7
## 4           4.1        1.2
## 5           4.3        3.5
## 6           4.9        2.5

So now we’ve got a matrix called “weightLoss" that has two columns: one records how many hours a week a person exercised, and the second row records their weight loss for that week. This is a matrix because it has both rows and columns.

You can always use R to read in data if it comes in csv form. If your data do not come in csv form, then you’ll have to use Excel to convert it to csv. R doesn’t handle Excel files very well.

Regression

Let’s continue to work with the dataset you imported (i.e., weightLoss). First let’s compute the mean of the two variables

mean(weightLoss$ExerciseHours)
## [1] 4.953
mean(weightLoss$WeightLoss)
## [1] 3.09

Be careful to watch capitalization–R is case sensitive. Also, you’ll notice that I’m only showing you the output (i.e., the left-side window). I’m actually writing it in my right window, but am only showing what happens in the left window to save space.

So, the mean amount of time spent exercising is around 4.95 hours. Also, the average amount of weight loss was around 3, 3, 3, 1, 4, 2, 3, 3, 4, 3, 4, 4, 2, 3, 2, 1, 4, 3, 4, 3, 4, 4, 3, 4, 3, 2, 5, 3, 5, 2 pounds. Let’s see how large the sample is. To do that, we’ll use the function “nrow,” which is short for number of rows (which is the sample size).

<<>>== nrow(weightLoss)


So, there are `rnrow(weightLoss)` individuals who participated in this study. Let's compute the correlation. To do so, we'll use the function ``cor."

<<>>==
cor(weightLoss)

That function returns a correlation matrix. The correlation is moderately high, with a value around 0.58.

Now let’s go ahead and run a regression analysis. To do that we write

lm(WeightLoss~ExerciseHours, data=weightLoss)
## 
## Call:
## lm(formula = WeightLoss ~ ExerciseHours, data = weightLoss)
## 
## Coefficients:
##   (Intercept)  ExerciseHours  
##        -0.913          0.808

Sometimes R defaults to outputting weird things when you run a function. We can be a little more specific about what we want by assigning the regression model to an object. For example

model = lm(WeightLoss~ExerciseHours, data=weightLoss)

Now R stores all the information about the regression into an object called “model." We can now ask R to report several things such as

##### output a summary of the model
summary(model)
## 
## Call:
## lm(formula = WeightLoss ~ ExerciseHours, data = weightLoss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6893 -0.4752 -0.0505  0.6501  1.2723 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.913      1.066   -0.86  0.39901    
## ExerciseHours    0.808      0.213    3.79  0.00073 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.781 on 28 degrees of freedom
## Multiple R-squared:  0.339,  Adjusted R-squared:  0.316 
## F-statistic: 14.4 on 1 and 28 DF,  p-value: 0.000735
##### give me just the intercept and slope of the model
model$coefficients
##   (Intercept) ExerciseHours 
##       -0.9126        0.8081
##### give me the conditional variance
summary(model)$sigma
## [1] 0.7814

Notice the use of the pound signs (####). That tells R to ignore everything on that line. In other words, they are simply comments to myself.

cf = round(model$coefficients, digits=4)

Using the results from \(model\$coefficients\), we see that the best-fitted regression equation is \(\hat{\text{Weight Loss}} = -0.9126 + 0.8081\times\text{Exercise}\). In other words, with no exercise, we’re expected to lose approximately -0.9 pounds (i.e, we’re expected to gain a little bit). For every hour we exercise, we’re expected to lose about 0.8 pounds.

Let’s go ahead and look at a scatterplot of the data with a regression line in red.

plot(weightLoss)
abline(model, col="red")

The first line tells R to plot the pairs of datapoints. The second code (abline….) tells R to plot a line from a to b (hence, abline) based on the object called “model." Remember that this object (model) contains the results of the regression equation. Somehow, R knows in the background that it’s supposed to plot a line. Then, I told it to plot the line in red.

Functions

A function is a set of instructions to the computer. It receives input, then spits out output. For example, we used the function “mean" to compute the mean of the weight dataset. It received an input (the vector called weight) and spit out an output (the mean). Also the plot function received an input (a vector or a matrix) and returned an output (a graph).

Sometimes a function returns multiple outputs such as the \(lm\) function. (Recall that it spit out the slope and intercept parameters, the conditional variance, a summary, etc.) Also, sometimes functions require multiple inputs. Again, the \(lm\) function was one such example (we had to input a regression equation and the dataset).

The Table below lists some of the functions we have learned so far. In one column we show what the inputs are and in the other we show what the outputs are.

Sometimes, however, you may forget what the inputs and/or outputs are. There’s a simple way to access that information. Let’s see what the inputs/outputs are for the cor function.

?cor

Notice when you did that, either a window popped up or a new webpage in your browser appeared. It should look like this:

[[insert image]]

Whenever you put a question mark in front of a function name then run the command, R will automatically bring up the documentation for it. The description is obvious–it tells you what the function does and may give some other relevant information. The Usage section tells what arguments the function takes. You’ll notice it says under the cor function “x, y = NULL,” etc. If you’re ever confused about what an argument means, then you can read in the Arguments section. For example, if we were unsure of what \(x\) was supposed to be, we would read, “a numeric vector, matrix, or data frame." Notice that, although we only supplied one argument before (i.e., the weight vector or the weightLoss matrix) there are several more arguments we could have passed it. If you’re interested in what those arguments are, please read on.

If you scroll down, you will notice another section called \(Examples\). Not surprisingly, this section will give you examples about how to use the \(cor\) function.

The important point to take away from this is that any function you use has arguments (inputs) and returns a result (outputs). If you ever have questions about what inputs/outputs are attached to a function or how to use it, type a question mark before the function and run the command.

Packages

Packages

Perhaps the coolest thing about R is the packages. So what are packages? To understand, let’s suppose your name is Greg and you start using R because it’s the cool thing to do. As you begin to test its functionality, you’re quite surprised that, in order to do a simple \(Z\) test, you have to do a lot of programming:

  #### here's my sample data--Greg
x = c(7,11,9,12,13,8,12,12,10,9)  
  #### this is the value of mu. 
    #### Is my sample different from 7?--Greg
mu = 7 
sigma = 2 
xbar = mean(x)
     ###### compute the z score --Greg
z.score = (xbar-mu)/(sigma/sqrt(length(x))) 
    ##### look up p value
p.value = pnorm(z.score, lower.tail=FALSE)
    ##### make a conclusion
if (p.value < .05) {
    cat("Reject the Null")
}

Notice how Greg has been wise with his generous use of comments (recall comments are noted by anything that follows one or more pound signs). If Greg runs the code, he gets the following:

    #### here's my sample data--Greg
x = c(7,11,9,12,13,8,12,12,10,9)  
    #### this is the value of mu/stdev. 
    #### Is my sample different from 7?--Greg
mu = 7 
stdev = 2 
xbar = mean(x)
     ###### compute the z score --Greg
z.score = (xbar-mu)/(stdev/sqrt(length(x))) 
    ##### look up p value
p.value = pnorm(z.score, lower.tail=FALSE)
    ##### make a conclusion
if (p.value < .05) {
    cat("Reject the Null")
} else {
    cat("Fail to Reject the Null")
}
## Reject the Null

But that seems like a whole lotta work just to make a simple conclusion. He decides to write his own function (which we’ll get to later) so that he doesn’t have to write so many lines of code every time he wants to do a z-test. Now, all Greg has to do is two lines of code:

## Loading required package: TeachingDemos
x = c(7,11,9,12,13,8,12,12,10,9)
z.test(x, mu=9, stdev=2)
## 
##  One Sample z-test
## 
## data:  x
## z = 2.055, n = 10.000, Std. Dev. = 2.000, Std. Dev. of the sample
## mean = 0.632, p-value = 0.03983
## alternative hypothesis: true mean is not equal to 9
## 95 percent confidence interval:
##   9.06 11.54
## sample estimates:
## mean of x 
##      10.3

(Note: many other features have been added beyond a simple “reject the null.”) After writing the function, he’s quite proud of it and thinks, “I can’t be the first person who wanted to do a z-test in R. There’s got to be others out there. I bet they’d appreciate it if I made my package available to them!”

So Greg decides to make his z score function available online. When someone makes their function (or set of functions) available through R’s website, it is called a package. The cool thing about packages is that there are thousands upon thousands of R users, many of which write packages and publish them online. That means that R has a lot of added functionality that it otherwise would not because somebody somewhere decided to publish a package.

Installing a Package

As I said, there are lots of R packages available online. But the good news is that you don’t have to go searching for them to download them. Instead, you can download them right from R’s interface. Let’s go ahead and download a z score function from a package called . The author of this package is Greg Snow (no relation to our fictitious Greg of earlier). To download the package type the following

install.packages("TeachingDemos")

A new window will pop up asking you to pick a repository. A “repository” is just a fancy word for “a place to download the package from.” Then it will download the package and make it available for you to call. To call the package (i.e., to make it available for use), you simply type

require(TeachingDemos)

Now you’re ready to use the z.test function by Greg Snow. Thanks Greg!

Practice

Practice

For each of the questions, remember that you can always look at the documentation to find out more information about the functions. Also, I recognize that it would be nearly impossible to determine how to do this practice with only the information I’ve provided. Consequently, I recommend becoming quite familiar with Google.

1. Using the dataset titled zdata.csv, do a z-test to determine whether the sample data differs from the population mean. Assume \(\mu=100\) and \(\sigma=15\). Be sure to report the \(z_{obt}\) and \(p_{obt}\) as well as to state the conclusion.

  1. Using the dataset called tdata.csv to conduct a two independent sample t-test.

    3. Using that same dataset, conduct a two dependent sample t-test

  2. Using the dataset avdata.csv, conduct an anova to determine whether the three groups differ. Be sure to report the F statistic and a p value.

// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$(‘tr.header’).parent(‘thead’).parent(‘table’).addClass(‘table table-condensed’);
});

(function () {
var script = document.createElement(“script”);
script.type = “text/javascript”;
script.src = “https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&#8221;;
document.getElementsByTagName(“head”)[0].appendChild(script);
})();