Introduction

The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the textbook and also to analyze real data and come to informed conclusions. To straighten out which is which:

If you are using RStudio Cloud, you should be all set (although please let me know if your account does not work), but if you want to work locally on your computer, you need to install both R and RStudio. - Download and install RStudio Desktop - Download and install R

As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.

The panel in the upper right contains your environment as well as a history of the commands that you’ve previously entered. Any plots that you generate will show up in the panel in the lower right corner.

The panel on the left is where the action happens. It’s called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

To get you started, enter all commands at the R prompt (i.e. right after > on the console); you can either type them in manually or copy and paste them from this document.

R has a number of additional packages that help you perform specific tasks. For example,dplyr is an R package designed to simplify the process of data wrangling, and ggplot2 is for data visualization based on the Grammer of Graphics (a famous book).

In order to use a packages, they must be installed (you only have to do this once, and I’ve done it already for you with many packages) and loaded (you have to do this every time you start an R session).

If you have not installed the packages already (particularly if you are running R and RStudio on your local machine), run the commands below. It may take a few minutes; you’ll know when the packages have finished installing when you see the R prompt (>) return in your console.

install.packages("tidyverse")
install.packages("skimr")
install.packages("Stat2Data")
install.packages("car")

Now that we have some packages, we can start doing some Explotatory Data Analysis

Exploratory Data Analysis

We are going to be looking at some plots and summary statistics as part of the technique of Exploratory Data Analysis (EDA). EDA was named by John Tukey in the 1960s, and continues to be exceedingly useful today. Essentially, Tukey was advocating getting familiar with data before beginning modeling, so you don’t run into errors that are easy to catch visually but hard to catch numerically.

To begin, we need to load packages to use in our session. We can do this either with library() or require(). I try to be consistent, but sometimes change it up.

library(tidyverse)
library(skimr)
library(car)

Then, we need some data. For this lab I chose the Salaries dataset, because it comes with R and had the right number of variables for the lab. Plus, it has to do with college professors.

Since the data is already in R, we can access it with the data() command,

data(Salaries)

Notice what happened in your environment. R uses lazy evaluation, so it’s not going to load the data in until we actually do something with it.

So, let’s start by looking at it.

str(Salaries)
## 'data.frame':    397 obs. of  6 variables:
##  $ rank         : Factor w/ 3 levels "AsstProf","AssocProf",..: 3 3 1 3 3 2 3 3 3 3 ...
##  $ discipline   : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
##  $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
##  $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
##  $ salary       : int  139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...

Looking at the structure of our data can help, but skimming is even more complete.

skim(Salaries)
## Skim summary statistics
##  n obs: 397 
##  n variables: 6 
## 
## ── Variable type:factor ───────────────────────────────────────────────────
##    variable missing complete   n n_unique
##  discipline       0      397 397        2
##        rank       0      397 397        3
##         sex       0      397 397        2
##                         top_counts ordered
##              B: 216, A: 181, NA: 0   FALSE
##  Pro: 266, Ass: 67, Ass: 64, NA: 0   FALSE
##           Mal: 358, Fem: 39, NA: 0   FALSE
## 
## ── Variable type:integer ──────────────────────────────────────────────────
##       variable missing complete   n      mean       sd    p0   p25    p50
##         salary       0      397 397 113706.46 30289.04 57800 91000 107300
##    yrs.service       0      397 397     17.61    13.01     0     7     16
##  yrs.since.phd       0      397 397     22.31    12.89     1    12     21
##     p75   p100     hist
##  134185 231545 ▃▇▇▅▃▂▁▁
##      27     60 ▇▆▅▅▂▂▁▁
##      32     56 ▆▇▇▇▆▅▂▁

Once we have an idea of what the data look like, we can make some plots.

Graphics libraries in R

There are three prominent graphics libraries in R:

We will likely use graphics from all three libraries, but I’ll try to focus on ggplot2 as much as possible.

Single-variable plots in ggplot2

For one quantitative variable, you might want to produce a histogram to show the distribution.

ggplot(data=Salaries) + geom_histogram(aes(x=salary))

You could also view a smoothed version of the distribution with a density plot

ggplot(data=Salaries) + geom_density(aes(x=salary))