```{r, message=FALSE, warning=FALSE, echo=FALSE}
require(knitr)
opts_chunk$set(eval=FALSE)
```
```{r, message=FALSE}
library(tidyverse)
library(Stat2Data)
library(skimr)
```
To this point in the course we have learned how to build models for quantitative response variables as functions of quantiative explanatory variables, and binary categorical variables. Today we are going to learn how to build models for quantitative response variables as a function of a single categorical variable. These techinques are known (somewhat confusingly) as ANOVA.
Let's follow the example from the book about fruit flies. Here, researchers have conducted an experiment to measure the `Longevity` of males fruit flies (in days). 25 male fruit flies were assigned to live in five different environments (`Treatment`). Our goal is to understand how the average lifespan is related to the particular treatment. To do this we will use an ANOVA model.
```{r}
data(FruitFlies)
head(FruitFlies)
```
Before we do anything else, we want to pay attention to the treatment levels. The `Treatment` variable has the type **factor** in R. There are five different kinds of treatments:
```{r}
FruitFlies %>%
count(Treatment)
```
In this case, the factor level *none* corresponds to the control group, so we want to tell R to use that as the reference level, since by default, it uses the first label in alphabetical order.
```{r}
# Set the reference level
FruitFlies <- FruitFlies %>%
mutate(Treatment = relevel(Treatment, ref="none"))
```
You can check the `count()` again to see what has happened to the data.
Next, let's do some basic exploratory data analysis. We can examine the means and standard deviations among the different treatments graphically
```{r, fig.width=10 }
ggplot(data=FruitFlies) + geom_dotplot(aes(x=Longevity))
ggplot(data=FruitFlies) + geom_dotplot(aes(x=Longevity)) + facet_wrap(~Treatment)
ggplot(data=FruitFlies) + geom_boxplot(aes(y=Longevity, x=1))
ggplot(data=FruitFlies) + geom_boxplot(aes(y=Longevity, x=Treatment))
```
and in a table:
```{r}
FruitFlies %>%
group_by(Treatment) %>%
skim(Longevity)
```
##### ANOVA Models
The simplest model we could build for `Longevity` would have no variable at all -- just the average value of the response variable. This model is sometimes called the **grand mean** model, since the only term in the model is the mean of the response variable.
```{r}
fm_null <- lm(Longevity ~ 1, data=FruitFlies)
summary(fm_null)
FruitFlies %>%
skim(Longevity)
fitted <- fitted.values(fm_null)
```
1. Why is there no $R^2$ reported for this model?
2. What does this model look like?
One way to visualize it is with a dot plot, and add the "regression line."
```{r, fig.width=10 }
ggplot(data=FruitFlies) + geom_dotplot(aes(x=Longevity)) + geom_vline(xintercept = coef(fm_null), color="red")
```
Of course we'd like to construct a better model by having separate estimates for each of the different groups. But we want to understand the differences between groups in the context of the grand mean. To do this we build a model of the form
$$
Y = \mu + \alpha_1 + \cdots + \alpha_K + \epsilon,
$$
for $k=1,...,K$. As usual, we require that $\epsilon \sim N(0, \sigma_\epsilon)$.
This is straightfoward in R:
```{r}
fm_aov <- aov(Longevity ~ Treatment, data=FruitFlies)
summary(fm_aov)
model.tables(fm_aov)
```
Note that this model is equivalent, up to some algebra, of the one produced by `lm`, which has the form:
$$
Y = \beta_0 + \beta_1 + \cdots + \beta_{K-1} + \epsilon',
$$
for $k=1,...,K-1$. As usual, we require that $\epsilon' \sim N(0, \sigma_\epsilon')$. Note that $\beta_0 = \mu + \alpha_1$, and $\beta_{k} = \alpha_{k+1} - \alpha_1$, for $k = 1,\ldots,K-1$.
```{r}
fm_ref <- lm(Longevity ~ Treatment, data=FruitFlies)
summary(fm_ref)
anova(fm_ref)
FruitFlies <- FruitFlies %>%
mutate(grandmean = mean(Longevity)) %>%
group_by(Treatment) %>%
mutate(groupmean = mean(Longevity), effect = groupmean - grandmean)
FruitFlies
effects <- FruitFlies %>%
group_by(Treatment) %>%
summarize(effect = mean(effect))
```
1. What are the algebraic steps that have to take place?
2. What is the sum of the group effects? Will this always be true?
3. How many "regression lines" will our model produce?
We can plot our "model" by adding lines to each panel.
```{r}
ggplot(data=FruitFlies) + geom_dotplot(aes(x=Longevity)) + geom_vline(xintercept = effects$effect, color="red") + facet_wrap(~Treatment)
```
##### Checking conditions
We still need to check whether the conditions for ANOVA were met.
```{r, fig.width=10 }
plot(fm_aov)
FruitFlies %>%
group_by(Treatment) %>%
skim(Longevity)
```
1. Repeat this procedure for the **Hawks** data set. Build an ANOVA model for `Weight` as a function of `Species`.
2. Find the group effects, and produce a plot.
3. Calculate the sum of the groups effects. Calculate the *weighted* sum of the group effects.
```{r}
data(Hawks)
```