# Introduction

### What is R?

Very few researchers compute statistics by hand. Most make computers do their statistics for them. Also, some statistical routines are impossible to do without computers.

Several programs are available for data analysis, including SPSS, Excel, SAS, Matlab, Minitab, and R. Several of these programs have a familiar and easy to use interface (such as SPSS, Excel, and Minitab)–in these programs you simply point and click in order to analyze your data.

However, ease of use often comes at the expense of flexibility. Some programs (such as Excel or Minitab) only allow the user to do basic statistical analysis and data manipulation. However, the “big dawgs” like to use programs that are much more flexible, such as SAS and R.

### Programming Statistics

While Excel and SPSS have point and click interfaces, R does not. Instead, the user must program their statistical analysis. For example, to perform a regression analysis in Excel is a simple matter of pointing and clicking. R, on the other hand, must be told through a computer language how to perform the statistical analysis:

lm(y~x, data=testData, subset=(group==1))

Don’t worry about what interpreting the code just yet, but you can see that things become a little more complicated when the user must code their statistical analysis rather than pointing and clicking. However, with that added complication comes added flexibility.

However, explaining what R is, it’s probably better for you to dive in. In the next chapter, we will show you how to perform some rudimentary R operations to give you a feel for how to “speak” the R language. After that, we’ll dive in to some more advanced R programming.

### What This Tutorial Is and Is Not

This tutorial does not provide comprehensive documentation of the R language. Instead, it’s a brief introduction to the language to help you get your “feet wet.” I have learned several programs in my lifetime, and I’ve found the best way to learn them is to have a short introduction, then spend the rest of the time messing around with it. So, with that, let’s wet those feet of yours!

### Installation

[[insert image]]

Next you will be taken to a screen where you can actually download R. Click on the top-most link (make sure you have an operating system that meets the minimum requirements). If your OS doesn’t meet the minimum requirements, then choose a file that does.

[[insert image]]

Then you wait! The download shouldn’t take too long. After it downloads it, open the file and allow R to install on your machine. Congrats! You just installed R!

### The R Interface

When you first open R, it will look like the figure below. You could start using R at this point by typing in commands at the point where the > is located. Let’s go ahead and try that.

[[insert image]]

The first thing that we’ll do is create a variable called hello and give it the value of 107. To do that, we would do the following

hello = 107
hello
## [1] 107

Notice how it returned the value “107” back to you. All you did was create something (we’ll call it a variable) called
“hello” and assigned it a value of 107. Now let’s try something else. Let’s create a called weight:

 weight = c(143, 137, 137, 131, 129, 125, 125, 124, 124, 120)
weight
##  [1] 143 137 137 131 129 125 125 124 124 120

Suppose that this represents your weekly weight over a 10 week period. (Kudos to you! You’re learning R and losing weight!). So here’s what we did–we created a variable called weight, but rather than holding a single value, it actually contains 10 values. This is called a vector. That’s why we put “c(…)” there– stands for concatenate’ which is is an overly technical way to say, “I put a bunch of things into a single object.”

Since now contains several numbers, we can start doing statistics on it. For example, we may compute the mean of it.

mean(weight)
## [1] 129.5

(don’t forget to hit enter). Or we could compute the standard deviation

sd(weight)
## [1] 7.367

or the median

median(weight)
## [1] 127

Or we could plot it

plot(weight)

At this point, hopefully your results matched mine. If so, good job! You’ve just done your first bit of R programming.

Now go ahead and close the plot window (after spending several minutes admiring it!) Now let’s suppose that you should have put the value 135 in the third position, not 137. Oops. To correct it, there’s several things you could do. Here’s one way: you push the up arrow button six times until R shows the following:

weight = c(143, 137, 137, 131, 129, 125, 125, 124, 124, 120)

Then you could click on the third 137, change it to 135, then hit enter. Go ahead and do it. In this example, that was easy because you noticed the error only 6 steps into your code. But suppose you had done 100 things between creating the variable and noticing the error. You’d have to push the up-arrow button 100 times! So, let me show you a much better way of keeping track of your code and keeping a better record of what you’re doing.

The first thing we’ll do is hit command-N (if you’re on a PC, you’ll have to go to the file menu, then click on “New script“). Notice that a new window pops up.

[[insert image]]

The good thing about this second window is that it allows you to write code, save it, edit it, then selectively run it. Let me give you an example. On the right window (the one that you just opened), write the following code just as you did before:

 weight = c(143, 137, 137, 131, 129, 125, 125, 124, 124, 120)
mean(weight)
sd(weight)
median(weight)
plot(weight)

Then go ahead and hit command-S (or control-S) to save the document. Save it somewhere you’ll remember. I saved mine in my documents folder and called it “weightloss." When you save a script (which is what this window is called), then you can access and run it later. Now that it’s saved, let’s go ahead and have R run the script. To do so, click the mouse anywhere on the first line and hit command-enter (on a Mac) or control-R (on a PC). Notice that on the left window it has the following:

 weight = c(143, 137, 137, 131, 129, 125, 125, 124, 124, 120)

So, what just happened? What you did was create your code in the right window, then sent that line to the left window. To run all the code, click on the right window, hit command-A (or control-A) to highlight all your code, then hit command-enter (or control-R). Notice how all of your code is run simultaneously.

The good thing about this method (i.e., the method of having two windows) is that you have one window where you can edit/write your code and the other window is dedicated to running the code. That way, if you need to make an edit, you don’t have to keep pushing the up arrow everytime you need to change something. Instead, you edit it in the right window, then execute the code (by pushing command-enter or control-R).

# Basic Commands

### Scalars, Vectors, and Matrices

Let me start by informing you how much you know. You already know what a scalar and a vector are. Remember the variable $$hello$$? That was a scalar. A scalar is a variable that contains only one value (in that case it was 107). The variable “weight" on the other hand, was a vector–it contained multiple values. There is still one other type of variable: a matrix. Aside from being a deceptive reality that people like Neo and Morpheus battle to overthrow, a matrix also has special meaning in R. When you think of a matrix, think of a spreadsheet. If you’re unfamiliar with what a spreadsheet is, think of a table with columns and rows. Vectors don’t have columns and rows, it’s just a list of numbers. Matrices, on the other hand, do.

Let’s try to get a visual of what a matrix looks like. In your file that you created (the one that I called “weight”), write the following after the line that plots the data

##      [,1] [,2]
## [1,]  143  125
## [2,]  137  125
## [3,]  137  124
## [4,]  131  124
## [5,]  129  120

After that, highlight the lines that you just wrote, then hit command-enter. On the left window, you should see the following

 weightData = c(143, 137, 137, 131, 129, 125, 125, 124, 124, 120)
matrixWeightData = matrix(weightData, nrow=5, ncol=2)
matrixWeightData
##      [,1] [,2]
## [1,]  143  125
## [2,]  137  125
## [3,]  137  124
## [4,]  131  124
## [5,]  129  120

Before you get overwhelmed by what I just did, let me talk in R. Here’s what R heard, “Alright R, I want you to create a new vector called weightData. Give it the values 143, 137, 137, 131, 129, 125, 125, 124, 124, and 120. Then, R, I want you to create a matrix called matrixWeightData that contains the exact same information as weightData, but I want you to give the matrix 5 rows and 2 columns. Got it R? Ok, then I want you to show me what matrixWeightData looks like. Now, go!”

Notice that both the variables “weight” and “matrixWeightData” contain the exact same information. The only difference is that weight is a vector, and matrixWeightData is a matrix. So, now so you can see them right next to each other, type the following. Notice how I’m typing it into the left window because I feel no need to save this information, but you can if you want to.

weight
##  [1] 143 137 137 131 129 125 125 124 124 120
matrixWeightData
##      [,1] [,2]
## [1,]  143  125
## [2,]  137  125
## [3,]  137  124
## [4,]  131  124
## [5,]  129  120

Again, the variables have exactly the same information, but one is contained as a vector and the other is contained as a matrix.

You should have received a dataset called “workout.csv.” Store that file in the same folder that you saved your “weight” script in. Now, make sure R is open, then I want you to select click on the menu called “Misc” then click on “Change Working Directory” just like the picture below.

[[insert image]]

Next, navigate to the location where you have been using your R files. For me, that was located in the documents folder. If you’re on a PC, you’ll have to first click on the left window, then click on File->Change Dir…

Here’s what you just did. By default, if you ask R to find a particular file, it will search wherever its default directory is. That’s a problem when it doesn’t default to the folder where your file is. All we did was change it’s default directory. Now when you tell R to find the file “workout.csv," it will know where to find it.

Now that we’ve changed the default directory, let’s tell R to import the file. I’ll be using my right window so I can edit it later. You can continue using the same script that you used before, or you can create a new one.

weightLoss = read.csv("workout.csv")
head(weightLoss)

Here’s what I’m telling R for the first line of code: “Hey R, in the default directory you should find a file called workout.csv.’ Open that file and put all of its contents in a variable called weightLoss.”

The second line of code (i.e., head(weightLoss)) simply tells R to return the first 7 rows of the data. Now if I run the code, the left the window will show

weightLoss = read.csv("workout.csv")
head(weightLoss)
##   ExerciseHours WeightLoss
## 1           6.1        2.7
## 2           4.8        2.7
## 3           4.6        2.7
## 4           4.1        1.2
## 5           4.3        3.5
## 6           4.9        2.5

So now we’ve got a matrix called “weightLoss" that has two columns: one records how many hours a week a person exercised, and the second row records their weight loss for that week. This is a matrix because it has both rows and columns.

You can always use R to read in data if it comes in csv form. If your data do not come in csv form, then you’ll have to use Excel to convert it to csv. R doesn’t handle Excel files very well.

### Regression

Let’s continue to work with the dataset you imported (i.e., weightLoss). First let’s compute the mean of the two variables

mean(weightLoss$ExerciseHours) ## [1] 4.953 mean(weightLoss$WeightLoss)
## [1] 3.09

Be careful to watch capitalization–R is case sensitive. Also, you’ll notice that I’m only showing you the output (i.e., the left-side window). I’m actually writing it in my right window, but am only showing what happens in the left window to save space.

So, the mean amount of time spent exercising is around 4.95 hours. Also, the average amount of weight loss was around 3, 3, 3, 1, 4, 2, 3, 3, 4, 3, 4, 4, 2, 3, 2, 1, 4, 3, 4, 3, 4, 4, 3, 4, 3, 2, 5, 3, 5, 2 pounds. Let’s see how large the sample is. To do that, we’ll use the function “nrow,” which is short for number of rows (which is the sample size).

<<>>== nrow(weightLoss)


So, there are rnrow(weightLoss) individuals who participated in this study. Let's compute the correlation. To do so, we'll use the function cor."

<<>>==
cor(weightLoss)

That function returns a correlation matrix. The correlation is moderately high, with a value around 0.58.

Now let’s go ahead and run a regression analysis. To do that we write

lm(WeightLoss~ExerciseHours, data=weightLoss)
##
## Call:
## lm(formula = WeightLoss ~ ExerciseHours, data = weightLoss)
##
## Coefficients:
##   (Intercept)  ExerciseHours
##        -0.913          0.808

Sometimes R defaults to outputting weird things when you run a function. We can be a little more specific about what we want by assigning the regression model to an object. For example

model = lm(WeightLoss~ExerciseHours, data=weightLoss)

Now R stores all the information about the regression into an object called “model." We can now ask R to report several things such as

##### output a summary of the model
summary(model)
##
## Call:
## lm(formula = WeightLoss ~ ExerciseHours, data = weightLoss)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -1.6893 -0.4752 -0.0505  0.6501  1.2723
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)     -0.913      1.066   -0.86  0.39901
## ExerciseHours    0.808      0.213    3.79  0.00073 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.781 on 28 degrees of freedom
## Multiple R-squared:  0.339,  Adjusted R-squared:  0.316
## F-statistic: 14.4 on 1 and 28 DF,  p-value: 0.000735
##### give me just the intercept and slope of the model
model$coefficients ## (Intercept) ExerciseHours ## -0.9126 0.8081 ##### give me the conditional variance summary(model)$sigma
## [1] 0.7814

Notice the use of the pound signs (####). That tells R to ignore everything on that line. In other words, they are simply comments to myself.