Homework 4.1
All materials can be found at alexcardazzi.github.io.
Question 1
For this question, you will have to put together two sets of data. First, you will use college/university statistics from 1995’s issue of US News and World Report (data, documentation). Second, you will use earnings data of students from a variety of colleges/universities from College Scorecard (data, documentation).
- Read in the US News and World Report data. Save it as a data frame called
usnwr
. (2 Points)
- Read in the College Scorecard data into a data frame called
scorecard
. (2 Points)
- Remove all observations from
scorecard
that do not come from 2007. Next, mergeearnings_med
fromscorecard
intousnwr
by institution. (4 Points)
- Generate a summary statistics table for median student earnings, whether the school is private, the acceptance rate, size of the student body, student/faculty ratio, and instructional expenditure per student. Note: you will need to modify/create some of these variables. (4 Points)
- Estimate a regression where median student earnings (log) are explained by whether the school is private, the acceptance rate, size of the student body (log), student/faculty ratio, and instructional expenditure per student (log). Display the coefficient estimates (using
summary()
is fine). (3 Points)
- Interpret the coefficients. (5 Points)
Question 2
Suppose I record data on how many hours my students spend studying and sleeping the week before an exam. I am curious about how sleep and study quantities impact grade outcomes, so I use the data to estimate the parameters in the following model:
\[\text{Grade} = \beta_0 + \beta_1\text{Sleep} + \beta_2\text{Study}\]
- Interpret each coefficient in words. (3 Points)
- Suppose I am interested in how time spent doing other things will impact exam grades. How will the regression be effected if I add another variable named “Other” defined as \(\text{Other} = 24 - \text{Sleep} - \text{Study}\)? (2 Points)
- What would you expect to happen to \(\beta_2\) if I added a new variable called \(\text{IQ}\) to the model above? (2 Points)
Question 3
The quality of wine is highly subjective. In any case, having a wine that is highly rated by experts will surely increase sales. Examine these data on red and white wines for this question.
Read both datasets into R. (2 Points)
Create summary statistics tables for
quality
,pH
, andresidual.sugar
for red and white wine separately. (4 Points)Create a new column in both datasets called
red
. For red wines, this should be equal to one and for white wine this should be equal to zero. (1 Point)Merge the two datasets and call the result
wine
. (2 Points)Plot the average quality by alcohol content1 for both wine types on the same set of axes. Include a legend, axis labels, etc. What does this plot suggest to you? (4 Points)
The
pH
variable measures the acidity of the wine. The pH scale runs from 0 to 14 with lower numbers being more acidic and 7 being neutral (e.g. water). ConvertpH
into a variable calledacidity
where higher numbers mean higher acidity. (2 Points)Estimate the following model for red and white wines separately, and present the parameters in a table. (2 Points)
\[Q_i = \beta_0 + \beta_1 \text{Alcohol}_i + \beta_2\text{Acidity}_i + \beta_3 \text{Sugar}_i + \epsilon_i\]
- Discuss the similarities and differences in the model outputs. How would you interpret the coefficients? Hint: what type of variable is quality? (6 Points)
Footnotes
For this question, round alcohol content to the nearest tenths place.↩︎