In the last part of the module, we were building/estimating a correlation between GPAs and SAT scores so we would have a tool for guidance counselors to predict SAT scores given a student’s GPA. Or, maybe we were hired by real estate agents to create a tool to predict a home’s sale price given some features about the property.Or, maybe we want to estimate the impact some government program like SNAP has on employment outcomes for citizens.How can we use correlations to inform us?
Conditional Mean
Let’s suppose today is your first day as a guidance counselor and your first student comes into your office and says, “what do you think I’ll get on my SAT?” Unfortunately, you do not know anything about them, so the best guess you can come up with is… the average.
However, if the student told you that they have a below average GPA, you would likely adjust your guess downwards.By adjusting your guess downwards, you are implicitly calculating a conditional mean.When we know nothing, the best guess we can come up with is the mean of the entire distribution. We talked about this in Module 2. If you are to gain some additional information (e.g., GPA) before making your guess, your guess can only improve. We can demonstrate this with a simple simulation.
Suppose you were playing a (not so fun) game with a friend where they picked a player in this data at random and asked you to guess the player’s height. What height should you guess? Spoiler: the average height. Let me try to convince you.
Let’s define the quality of a guess as the squared difference of the height of the randomly the selected player (\(h_i\)) and the guessed height (\(G\)). Therefore, if the guess is spot on, the guess’s quality would be \((h_i - G)^2 = (0)^2 = 0\). If the guess is too high or too low, the resulting measure of quality would increase. In other words, this score is like golf where lower scores are better scores.We can then calculate the average guess quality by averaging \((h_i - G)^2\) across all heights \(h_1, h_2, ..., h_n\). Then, if we have two guesses \(G_1\) and \(G_2\), we could say one guess is better than the other if its average guess quality measure is smaller than the other’s.
In the following code, I am going to calculate the quality of a sequence of different guesses ranging from six inches below the sample average to six inches above the sample average. Then, I will plot the guess on \(x\)-axis and its corresponding quality on the \(y\)-axis. Remember: lower means a better guess!
Code
library("scales")bball <-read.csv("https://raw.githubusercontent.com/alexCardazzi/alexcardazzi.github.io/main/econ311/data/bball_allstars.csv")# Guesses range from 6 inches below the mean# and increase by a tenth of an inch until# 6 inches above the mean is reached.the_guesses <-seq(mean(bball$HEIGHT) -6,mean(bball$HEIGHT) +6,by =0.1)avg_sqr_diffs <-c()for(guess in the_guesses){ sqr_diff <- (bball$HEIGHT - guess)^2 ans <-mean(sqr_diff) avg_sqr_diffs[length(avg_sqr_diffs) +1] <-sqrt(ans)}plot(the_guesses, avg_sqr_diffs, las =1,xlab ="Guess for Height", pch =19,ylab ="Average Squared Error",col =alpha("dodgerblue", 0.33))abline(v =mean(bball$HEIGHT), col ="tomato")legend("bottomright", "Average Height", lty =1, col ="tomato", bty ="n")
Plot
What does this picture tell us? Perhaps unsurprising, it seems that the guess with the smallest error is simply the sample’s average. Now, what if your friend tweaked the game a bit and decided to only pull names from the WNBA portion of the data? Let’s see how the optimal guess might change.
Code
the_guesses <-seq(mean(bball$HEIGHT) -6,mean(bball$HEIGHT) +6,by =0.1)avg_sqr_diffs <-c()for(guess in the_guesses){ sqr_diff <- (bball$HEIGHT[bball$LEAGUE =="WNBA"] - guess)^2 ans <-mean(sqr_diff) avg_sqr_diffs[length(avg_sqr_diffs) +1] <-sqrt(ans)}plot(the_guesses, avg_sqr_diffs, las =1,xlab ="Guess for Height", pch =19,ylab ="Average Squared Error",col =alpha("dodgerblue", 0.33))abline(v =mean(bball$HEIGHT), col ="tomato")abline(v =mean(bball$HEIGHT[bball$LEAGUE =="WNBA"]),col ="mediumseagreen", lty =2)legend("bottomright", c("Average Height", "Average WNBA Height"),lty =1:2, col =c("tomato", "mediumseagreen"), bty ="n")
Plot
Now, the best guess is the average of the WNBA players instead of the average of the entire sample. This is probably obvious to you, and you might even be wondering what the point of this is. The idea is that without any information, the most accurate we can be is a simple average. However, once we gain additional relevant information, we can revise our guess and increase accuracy.
Keep in mind that our goal here was to minimize the sum (mean) of squared errors. We will revisit this soon, but first, we are going to jump back to guessing SAT scores given GPAs.
Our goal as guidance counselors is to give students our best guess for what they’ll score on their SAT. In essence, we want to create a function that converts GPA (as an argument) into an SAT score. We want this function to have a few properties:
SAT score must be dependent on GPA.
Given an average GPA, the model should output an average SAT.
SAT score should increase as GPA increases or decrease as GPA descreases (since they’re positively correlated).
The model should be correct on average.
Like we did with the law of demand, we are going to guess a functional form for the relationship between GPA and SAT:
This functional form satisfies our first two conditions right off the bat. However, we still need to select values for \(\beta_0\) and \(\beta_1\) to satisfy the third and fourth conditions.
This equation of a line gives us the conditional expectation of SAT given GPA. In other words, a predicted SAT score for any given GPA. This is most often written as \(E[ \ \text{SAT} \ | \ \text{GPA} \ ]\). You might say, “that’s great, but what values should we use for \(\beta_0\) and \(\beta_1\)?” In the chunk below, experiment with a few different values.
Unfortunately, no matter what values we choose for \(\beta_0\) and \(\beta_1\), there’s never going to be a single line that runs through every one of the points. To fix this, let’s write the equation as follows:
\[\text{SAT}_i = \beta_0 + \beta_1\text{GPA}_i + \epsilon_i\] where \(\epsilon_i\) (epsilon) is our error term. This will make up the difference between the point and the line we draw through the points.
Line of Best Fit
Ideally, we wouldn’t need \(\epsilon\) at all and our model would perfectly fit the data. However, since we do need \(\epsilon\), we want to minimize it.Specifically, like we did when guessing heights of basketball players, we are going to minimize the sum of squared errors, or \(\sum \epsilon_i^2\).Note that we can calculate the errors associated with choosing \(\beta_0\) and \(\beta_1\) by the following: \(\epsilon_i = \text{SAT}_i - \beta_0 - \beta_1 \text{GPA}_i\).
Since the errors will be minimized, the line given by \(\beta_0\) and \(\beta_1\) will be colloquially known as the “line of best fit”.
Ordinary Least Squares
The algorithm we are going to use to solve for \(\beta_0\) and \(\beta_1\) is called Ordinary Least Squares, or OLS. To begin, let’s start with writing an expression for the sum of squared errors.
\[S = \sum \epsilon^2= \sum (Y - \beta_0 - \beta_1X)^2 \] When we want to minimize an expression, we must take the derivative and set it equal to zero. Since \(\beta_0\) and \(\beta_1\) are unknown, these are what we need to take the derivative with respect to. First, we will focus on \(\beta_0\).
The following equations have text if you hover over them:
Now, let’s use some of this math to figure out \(\beta_0\) and \(\beta_1\) for our GPA and SAT example:
This expression gives us the expectation of a student’s SAT score given a student’s GPA. We know for sure that this minimizes the sum of squared errors (because all of that math), but is the model correct on average? To test it, we can calculate the average error. Being correct would yield an average of zero.
What does the model look like relative to the data? Using the chunk below, try entering in some of your favorite lines from before.
What do you think? How does OLS compare to your preferred line?
In terms of interpretation, what does this equation mean in words?
\(\beta_1\) (11.36): If we increase GPA by one unit (e.g., from 2.2 to 3.2), we expect the sum of our SAT percentiles to increase by 11.36. In more abstract terms, “a one unit increase in \(X\) will change \(Y\) by \(\beta_1\).”
\(\beta_0\) (66.99): If someone has a GPA of 0, we expect their sum of SAT percentiles to be equal to 66.99. In more abstract terms, “the value of \(Y\) when \(X=0\).”
As long as “a one unit change” makes sense for \(X\), \(\beta_1\) will be interpretable. To bring this back to an earlier module, the variables must be either interval or ratio measures.
For \(\beta_0\) to be interpretable, \(X\) must be able to take on a reasonable value of zero. Since a GPA of zero is nearly impossible, \(\beta_0\) does not have much of an interpretation in this case.
Statistical Inference
Just like how we tested if our sample mean was statistically different from some hypothesized value in the last module, we can test if our regression coefficients are different from hypothesized values as well.
In the notes, we use zero, but the hypothesized value could have been anything.
Focusing on \(\beta_1\), and skipping the math, we can write the standard error of \(\beta_1\) as follows:
As a reminder, we need the standard error to calculate a confidence interval and/or perform a hypothesis test. Also, you can think of a standard error as a measure of how precisely we estimated \(\beta\).
To test if your estimate of \(\beta_i\) is different from some hypothesized number, we first need to generate a test statistic:
\[t = \frac{\beta_i - \#}{se_{\beta_i}}\]
Generally, people choose 0 for \(\#\), since a coefficient of 0 would imply that the regressor (e.g., GPA) has no relationship with the outcome variable (e.g., SAT score). This way, the hypothesis test is set up such that it tests whether \(X\) is related to \(Y\).
Once we calculate this test statistic, we can calculate a \(p\)-value and decide whether to reject the null hypothesis that our regressor is not related to the outcome variable.
As a final note about inference (i.e., hypothesis tests), we need to make two important assumptions about our errors (\(\hat{\epsilon}_i\)) in order for the above t-statistic to be valid.
Normality: we assume that our errors are independent, random draws from a normal distribution.
Homoskedasticity: this big word means that the normal distribution we draw our errors from has a constant variance across all points. See below for examples of homoskedastic errors and heteroskedastic errors.
Plot
Goodness of Fit
Once we estimate \(\beta_0\) and \(\beta_1\), we can also begin to talk about how well the model “fits” the data. In other words, what fraction of \(Y\)’s variance is explained by \(X\).
Before getting too much further with this, I want to be very clear that goodness-of-fit measures do not determine whether your model is good. You may be able to explain a lot of \(Y\)’s variation, but still have a bad model. In addition, your model might have poor overall explanatory power, but still do a good job measuring the effect of \(X\) on \(Y\), which is what we care about. We will discuss this more as the course progresses.
Below are three equations for Total Sum of Squares (SST), Explained Sum of Squares (SSE), and Residual Sum of Squares (SSR). Hover over the equations for explanations.
Once we have defined these three measures, we can connect all three by the following equation:
\[SST = SSE + SSR\]
If our model is very good at explaining the variation in \(Y\), we would expect SSE to be very large relative to SSR. For a perfect model, we would see \(SST = SSE\). Therefore, we can create a ratio between these two numbers, which will give us a percentage of the variation that is explained by the model. We call this ratio \(R^2\), and define it as follows:
\[R^2 = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\]
Below, I demonstrate what I mean when I say \(R^2\) should not a be-all and end-all measure of how good a model is. I simulated a dataset of 100 \(X\) values. Then, I created two \(Y\) variables: \(Y_1 = 5X + \epsilon\) and \(Y_2 = 5X + 2.5\epsilon\). Both \(Y\) variables have the same relationship with \(X\), but the random errors are magnified in the second. The estimated coefficients are very similar, but the \(R^2\) measure is quite different. Again, we care more about the coefficients being estimated accurately rather than the model itself having explanatory power.