\documentclass{article}
\setlength{\paperwidth}{8.5in}
\setlength{\textwidth}{7in}
\setlength{\hoffset}{0in}
\setlength{\oddsidemargin}{0in}
\begin{document}
\title{Statistics 101A \\ Fall 2013}
\maketitle
<>=
options(width = 60)
@
\begin{enumerate}
\item Download the following data files to your hard drive, and then load into R. The two data files as well as the description of the data set are posted on the CCLE course web site (see week one). One file is tab-delimited, and the other is comma-delimited. Use LACER.csv.
\begin{enumerate}
\item Using R, find out how many variables are in the data file and how many observations.
<>=
d1 = read.csv("Lacer.csv",header=T)
dim(d1)
@
\item Make a histogram of the standardized math scores in 2003 and describe the distribution. Find appropriate summary statistics.
<>=
hist(d1$stmath02)
summary(d1$stmath02)
@
\item Find the summary statistics for gain in math scores by school. Then, draw the side-by-side boxplot of this variable by school. Based on this data, do you think average gain in math is significantly different in the four schools? Explain. Would it be a good idea to conduct two-sample t-test multiple times to find out if there is any statistically significant difference between the average gain in math scores in the four schools? Yes or no? Explain why.
<>=
summary(d1$gainread)
tapply(d1$gainread,d1$school, summary)
boxplot(d1$gainread~d1$school,main="Reading Gains by School Number")
@
\item Draw the appropriate plot to compare the standardized 2002 math scores in the LACER and control group. Given that we are interested to find out if LACER was effective in raising standardized math scores from 2002 to 2003, what is the use of this plot?
<>=
boxplot(d1$gainread~d1$group)
@
\end{enumerate}
\item Given the following information about abortion: Kaiser Family Foundation Health Tracking Poll. Aug. 10-15, 2011. $N=1,201$ adults nationwide. Margin of error $\pm 3$.
``In general, do you support or oppose the new federal requirement that private health insurance plans cover the full cost of birth control and other preventive services for their female patients?''
Support 66\% Oppose 24\% Unsure 10\%
\begin{enumerate}
\item Given the above information, run 500 simulations, construct the histogram of sample proportions, and calculate the confidence interval based on your findings and interpret it.
<>=
pop = c(rep(1,12),rep(0,11))#FITB in class
p = sum(pop)/length(pop)
samples = c()
for (i in 1:500) {
samp = sample(pop,800,replace=T)
estimate = sum(samp)/800
samples = c(samples, estimate)
}
hist(samples)
mean(samples)
sd(samples)
@
The 95\% confidence interval, \Sexpr{mean(samples) - sd(samples)}, \Sexpr{mean(samples)+ 1.96*sd(samples)}.
\item Calculate the 95\% confidence interval based on the mathematical equation given for the 95\% confidence interval for proportion and compare the results with part a.
\begin{eqnarray*}
SE(P)&=&\sqrt{\frac{P\times(1-P)}{N}} \\
95\% CI &=&\hat{p}\pm 1.96\times SE =
\end{eqnarray*}
<>=
c(mean(samples)-1.96*sd(samples), mean(samples)+1.96*sd(samples))
se = sqrt(p*(1-p)/800)
c(p-1.96*se, p+1.96*se)
prop.test(66, 100)
@
The mathematical results are very close to the simulation results found in part a.
\end{enumerate}
\item Use the following URL and download the pima data set into R.
\verb#http://archive.ics.uci.edu/ml/datasets#
\\ \verb#http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes#
<>=
library(faraway)
data(pima)
@
\begin{enumerate}
\item Use the summary command to compute the five number summary for the variable insulin. Then use the mean command to calculate the mean of the variable insulin. What happens when you try to find the average of the variable insulin? What causes this? You can get it to work by typing \verb#mean(pulse, na.rm=T)#. Why do you think R forces you to do something extra if there are missing values?
<>=
str(pima)
mean(pima$diastolic)
@
Set 0s to NA
<>=
pima$diastolic[which(pima$diastolic==0)] = NA
mean(pima$diastolic)
mean(pima$diastolic,na.rm=T)
@
\item Assume the pima data are a random sample of all the all female patients of at least 21 years old of Pima Indian heritage. Use R to find a 95\% confidence interval for the mean insulin level. Interpret this interval within context. Also find a 90\% confidence interval.
<>=
t.test(pima$diastolic, conf.level=0.95)
@
We are 95\% confident that the average level of insulin in female patients of at least 21 years old of Pima Indian heritage is between \Sexpr{t.test(pima$diastolic, conf.level=0.95)$conf.int[c(1:2)]}
<>=
t.test(pima$diastolic, conf.level=0.90)
@
\end{enumerate}
\item In one a random sample of 30, the mean income was \$34,500 and the standard deviation was equal to \$30,000.
\begin{enumerate}
\item Find the 95\% confidence interval for income in the population and interpret the results.
\item What assumptions must you make when computing the confidence interval?
\item Would it be reasonable to make the following claim: ``The probability that the income in the population ($\mu$) be between the lower and the upper bound of the confidence interval calculated above is 95\%'' Explain.
\item Test the null hypothesis that the mean income of the population in this country is equal to \$35,000 (two-tailed test).
\item What is the long run probability of incorrectly rejecting the null hypothesis?
\end{enumerate}
\item Given the following printout, answer questions a through e.
<>=
d3 = read.csv("hsb2.csv", header=TRUE)
t.test(d3$math ~ d3$gender)
@
\begin{enumerate}
\item State the null and the alternative hypothesis in symbols and words
\begin{eqnarray*}
H_0 &:& \mu_1 = \mu_2 \mbox{ or } \mu_1-\mu_2=0 \\
H_A&:& \mu_1-\mu_2\neq 0
\end{eqnarray*}
\item What is your conclusion regarding the null?
\item Write your conclusion within context.
\item Interpret the 95\% confidence interval.
\item Find the relevant descriptives in R and show how the t-value and confidence interval were calculated.
\end{enumerate}
\item Read problem three under section 2.8 on page 39. What is the linear model for predicting processing time from the number of bills? The data set is called invoices.txt and you can find it on the book website. Do not do the exercise. Use R to run the relevant analysis and write the linear model in terms of slope and intercept reported by R.
<>=
invoices = read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/invoices.txt",header=T,sep='\t')
m1 = lm(Time~Invoices, data=invoices)
summary(m1)
@
\begin{eqnarray*}
y=\Sexpr{m1$coefficients[2]}x+\Sexpr{m1$coefficients[1]}
\end{eqnarray*}
\item Download the newspaper data set discussed in Chapter 1 (circulation.txt) Load the data into R. Using R, answer these question
\begin{enumerate}
\item How many observations? How many variables?
<>=
circ = read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/circulation.txt",header=T,sep="\t")
str(circ)
@
\item Plot weekly circulation vs. Sunday circulation for the dummy variable (Tabloid = 1,
no tabloid = 0).
<>=
plot(circ$Weekday,circ$Sunday, pch=as.character(circ$Tabloid.with.a.Serious.Competitor))
@
\item Plot log (weekly circulation) vs. log(Sunday Circulation) for the dummy variable
(tabloid vs. no tabloid).
<>=
plot(log(circ$Weekday),log(circ$Sunday), pch=as.character(circ$Tabloid.with.a.Serious.Competitor))
@
\item Comment on the association between the two variable and the difference between the two scatterplots.
\end{enumerate}
\end{enumerate}
\end{document}