\documentclass[10pt]{article}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{fancyhdr,url,hyperref}
\usepackage{graphicx,xspace}
\usepackage{tikz}
\usetikzlibrary{shapes,arrows,decorations.pathmorphing,backgrounds,positioning,fit,through}
\oddsidemargin 0in %0.5in
\topmargin 0in
\leftmargin 0in
\rightmargin 0in
\textheight 9in
\textwidth 6in %6in
%\headheight 0in
%\headsep 0in
%\footskip 0.5in
\newtheorem{thm}{Theorem}
\newtheorem{cor}[thm]{Corollary}
\newtheorem{obs}{Observation}
\newtheorem{lemma}{Lemma}
\newtheorem{claim}{Claim}
\newtheorem{definition}{Definition}
\newtheorem{question}{Question}
\newtheorem{answer}{Answer}
\newtheorem{problem}{Problem}
\newtheorem{solution}{Solution}
\newtheorem{conjecture}{Conjecture}
\pagestyle{fancy}
\lhead{\textsc{Prof. McNamara}}
\chead{\textsc{SDS/MTH 291: Lecture notes}}
\lfoot{}
\cfoot{}
%\cfoot{\thepage}
\rfoot{}
\renewcommand{\headrulewidth}{0.2pt}
\renewcommand{\footrulewidth}{0.0pt}
\newcommand{\ans}{\vspace{0.25in}}
\newcommand{\R}{{\sf R}\xspace}
\newcommand{\cmd}[1]{\texttt{#1}}
\DeclareMathOperator{\Ex}{\mathbb{E}}
\DeclareMathOperator{\Var}{\text{Var}}
\DeclareMathOperator{\X}{\mathbf{X}}
\DeclareMathOperator{\Hatmat}{\mathbf{H}}
\rhead{\textsc{November 15, 2016}}
\begin{document}
\paragraph{Agenda}
\begin{enumerate}
\itemsep0em
\item Randomization warmup
\item ANOVA
\item Fisher's LSD
\end{enumerate}
<<>>=
library(knitr)
opts_chunk$set(size='footnotesize')
@
\paragraph{Randomization} With your neighbor, discuss how you could use randomization to determine if the $R^2$ value of a simple linear model was greater than 0. Write out at least three steps that would be required (perhaps these steps would need to be repeated, as well).
\vspace{1.5in}
\paragraph{ANOVA}
Recall that in the regression models we have considered thus far, we have always had at least one quantitative explanatory variable. How do we handle a model with \emph{only} a categorical explanatory variable? This technique is called ANOVA, for analysis of variance, but it is equivalent to regression with just a categorical variable.
An ANOVA model is \emph{phrased} differently than a regression model. It may also be helpful to think of ANOVA as a generalization of two-sample $t$-test.
Consider the usual ANOVA model
$$
y_{ij} = \mu_i + \epsilon_{ij}, \text{ where } \epsilon_{ij} \sim N(0, \sigma)
$$
for groups $i = 1,\ldots, I$ and individuals $j=1,\ldots,n_i$, with common standard deviation $\sigma$.
If we let $\mu = $ grand mean and $\alpha_i =$ the $i^{th}$ group effect, then we can write this for each of the $i=1,\ldots,I$ groups as
$$
y_{ij} = \mu + \alpha_i + \epsilon_{ij}, \text{ where } \epsilon_{ij} \sim N(0, \sigma)
$$
Note that all we have done is to decompose the mean of the $i^{th}$ group ($\mu_i)$ into two parts: the grand mean ($\mu$), and the difference bewteen the mean of the $i^{th}$ group and the grand mean ($\alpha_i$).
If we move $\mu$ to the left side of the equation, we get
$$
y_{ij} - \mu = \alpha_i + \epsilon_{ij}\,,
$$
and now summing over $i$ and $j$ gives
$$
SST = SSG + SSE
$$
ANOVA gives us a way to test for the statistical significance of group means. The null hypothesis is $H_0: \alpha_1 = \alpha_2 = \cdots = \alpha_I = 0$ -- that is, all of the groups effects are in fact zero, and thus the group means are the same. The alternative hypothesis is that $H_A: \exists \,i$ s.t. $\alpha_i \neq 0$ -- that is, at least one of the group effects is not zero. We can't tell which one.
<>=
require(mosaic)
require(Stat2Data)
data(ThreeCars)
a1 = aov(Price ~ CarType, data=ThreeCars)
model.tables(a1)
@
These are the $\alpha_i$'s. The grand mean $\mu$ is:
<<>>=
mean(~Price, data=ThreeCars)
@
and the group means $\mu_i$ are:
<<>>=
mean(Price ~ CarType, data=ThreeCars)
@
\paragraph{Conditions for ANOVA}
The conditions are the same as for regression (minus Linearity), including \emph{equal variance among groups}.
% Note that factors are evaluated as a group via $F$-test
To actually assess the Equal Variance condition among groups:
\begin{itemize}
\item Check residuals vs. fitted plot for similar spread across groups
\item Check standard deviations among groups
\item Check if $sd_{max} / sd_{min} <= 2$
\end{itemize}
% \item Importance on Randomization
% \item Randomized Experiments allow causal inference
% \item Observation studies permit limited inferences about populations
\paragraph{Equivalence of ANOVA and Regression}
Recall that one-way ANOVA is just a rephrasing of regression with a quantitative response variable and a single categorical explanatory variable.
Let $X_i$ be the indicator (binary) variable corresponding to the $i^{th}$ group. Let $\mu_I$ be the overall mean of the $I^{th}$ group, and note that since every observation $y_{ij}$ has to be in some group, if it isn't in the any of the first $I-1$ groups, it has to be in group $I$. Call this the \emph{reference group}, and set $\mu_i = \mu_I + \beta_i \cdot X_i$.
Then the ANOVA model above is equivalent to:
$$
y_{ij} = \mu_I + \beta_i \cdot X_i + \epsilon_{ij}, \text{ where } \epsilon_{ij} \sim N(0, \sigma)
$$
for $i=1,\ldots,I-1$.
This is exactly what happens when you compute {\tt lm(y $\sim$ x)} in R, with {\tt x} being a categorical variable! The main difference is that now $\beta_i$ represents the size of the effect of being in group $i$, \emph{relative to group $I$}, whereas $\alpha_i$ represents the size of the effect of being in group $i$, \emph{relative to the grand mean}.
<<>>=
m1 = lm(Price ~ CarType, data=ThreeCars)
coef(m1)
@
The pooled standard deviation $s_p$, a weighted average of the standard deviations of the groups, is an estimate of $\sigma$, the unknown common standard deviation. This equal to the residual standard error.
Note that the values predicted by both models are exactly the same!
<<>>=
sum(predict(a1) - predict(m1))
@
% Note also that it is clear that
% \begin{eqnarray*}
% y_{ij} - \epsilon_{ij} &=& \mu_i \\
% &=& \mu + \alpha_i = \\
% \beta_0 + \beta_i &\Rightarrow& \mu + \alpha_i = \\
% \beta_0 + \beta_i &\nRightarrow& \mu =\\
% \beta_0, \alpha_i &=& \beta_i
% \end{eqnarray*}
So, there are three ways to express (and interpret!) the same mode:
\begin{eqnarray*}
y_{ij} &=& \mu_i + \epsilon_{ij} \\
y_{ij} &=& \mu + \alpha_i + \epsilon_{ij} \\
y_{ij} &=& \mu_I + \beta_i \cdot X_i + \epsilon_{ij}
\end{eqnarray*}
\paragraph{Write out the three variations on the model} For this example using car types to predict price, write out the three models.
\vspace{3in}
\paragraph{Multiple Comparisons}
Once we have performed ANOVA, we often know that \emph{at least one} of our groups has a significantly different mean. Then, we often want to know \emph{which one}. This can lead to the problem of multiple comparisons.
\begin{itemize}
\item Individual Error Rate (Type I error) vs. Family-wise error rate
\begin{itemize}
\item Individual Type I error: one specific false rejection of null hypothesis
\item Family-wise Type I error: at least one false rejection of null hypothesis
\end{itemize}
\item Even when the probability of Type I error is low, if you are making many comparisons, then the probability of a family-wise Type I error is \emph{much} higher
\item Recall the \href{https://xkcd.com/882/}{jelly beans comic} from xkcd
\item Corrections for \emph{multiple comparisons} include:
\begin{itemize}
\item Bonferroni: divide the $\alpha$-level by the number of comparisons
\item Fisher's LSD
\item Tukey's HSD
\end{itemize}
\item These differ only in the choice of the critical value
\end{itemize}
\paragraph{Corrections for Multiple Comparisons}
\begin{itemize}
\item The algorithm
\begin{enumerate}
\item Perform ANOVA
\item If not significant, stop
\item Compute the pairwise comparisons using the confidence interval
$$
\bar{y}_i - \bar{y}_j \pm cv \sqrt{MSE \left( \frac{1}{n_i} + \frac{1}{n_j} \right) }
$$
\end{enumerate}
\item Fisher's Least Significant Difference: $cv$ is $t^*$ chosen according to $\alpha$ and $n-K$ d.f. (Most liberal of methods)
\item Bonferroni: $cv$ is $t^*$ chosen according to $\alpha/m$ and $n-K$ d.f., where $m = \binom{K}{2}$ (Most conservative of methods)
\item Tukey's Honest Significant Difference: $cv$ is $=\frac{q}{\sqrt{2}}$, where $q$ depends on \emph{studentized range distribution} (A moderate method)
\end{itemize}
\end{document}