Module 3.3: OLS in R
Old Dominion University
In the last set of notes, we derived the OLS estimator, calculated statistics for hypothesis tests, and measures for goodness of fit. In this set of notes, we are going to talk more about application.
First, to estimate a linear regression model, we will use the lm()
function.
lm()
, which stands for “linear model”, accepts the following arguments:
formula
: Formulas in R take on the following form: y ~ x1 + x2
. In the case for GPAs and SAT scores, we would write: sat_sum ~ hs_gpa
.data
: This is the data you plan to use. This tells R where to look for the variables in your formula.subset
: An optional vector that specifies a subset of observations used to fit the line. For example, maybe we want to fit two lines: one for men and one for women. We would use subset = df$sex == 1
to estimate the model using only data from female students. Another reason to use this would be to eliminate outliers, etc.We are going to practice on two sets of data. First, we are going to keep examining the GPA data (data; documentation). Second, we are going to explore some data on real estate transactions in Ames, Iowa (data; documentation). Let’s read both sets of data into R:
There are a lot of columns in the ames
data. We are only interested in a few (for now), so I am going to subset the data to keep only the columns we’ll use.
Let’s estimate the SAT and GPA model from the previous section of notes.
Call:
lm(formula = sat_sum ~ hs_gpa, data = df)
Coefficients:
(Intercept) hs_gpa
66.99 11.36
The output of lm()
returns the call you used to generate the output in addition to the estimated coefficients. These coefficients are exactly what we previously estimated with cov()
, var()
, cor()
, etc. Not only does lm()
output the coefficients, but it also returns both the fitted values (reg_sat_gpa$fitted.values
) and the residuals/errors (reg_sat_gpa$residuals
).
observed_y fitted_y residual_y
1 127 105.6232 21.376776
2 122 112.4411 9.558875
3 116 109.6003 6.399667
4 95 109.6003 -14.600333
5 107 112.4411 -5.441125
6 111 112.4411 -1.441125
We can plot our regression line using the abline()
function.
Using the summary()
function generates some important information about our regression and the individual coefficients.
Call:
lm(formula = sat_sum ~ hs_gpa, data = df)
Residuals:
Min 1Q Median 3Q Max
-42.055 -8.669 -0.351 8.803 34.240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.9885 2.4441 27.41 <2e-16 ***
hs_gpa 11.3632 0.7535 15.08 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.9 on 998 degrees of freedom
Multiple R-squared: 0.1856, Adjusted R-squared: 0.1848
F-statistic: 227.4 on 1 and 998 DF, p-value: < 2.2e-16
In the table portion, each coefficient has an Estimate
, Std. Errors
, t value
, and Pr(>|t|)
In addition, you can find the Multiple R-Squared
value, which is the measure for goodness of fit.
Regression for Male Students:
Call:
lm(formula = sat_sum ~ hs_gpa, data = df, subset = df$sex ==
2)
Residuals:
Min 1Q Median 3Q Max
-31.997 -8.305 -0.455 8.574 33.287
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 55.851 3.579 15.61 <2e-16 ***
hs_gpa 13.715 1.079 12.71 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.33 on 482 degrees of freedom
Multiple R-squared: 0.2512, Adjusted R-squared: 0.2496
F-statistic: 161.7 on 1 and 482 DF, p-value: < 2.2e-16
Regression for Female Students:
Call:
lm(formula = sat_sum ~ hs_gpa, data = df, subset = df$sex ==
1)
Residuals:
Min 1Q Median 3Q Max
-45.497 -8.323 0.153 8.877 31.153
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.280 3.183 22.39 <2e-16 ***
hs_gpa 11.019 1.003 10.98 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.55 on 514 degrees of freedom
Multiple R-squared: 0.1901, Adjusted R-squared: 0.1885
F-statistic: 120.6 on 1 and 514 DF, p-value: < 2.2e-16
Take a look at the Pr(>|t|)
column. This contains a p-value corresponding to the hypothesis test that \(\beta_i \neq 0\). The value <2e-16
means that the p-value is smaller than 0.0000000000000002
. R doesn’t report p-values that are smaller than this for brevity. In addition, R includes asterisks next to p-values that are statistically significant.
.
next to the p-value indicates \(0.05 < p \leq 0.1\)*
next to the p-value indicates \(0.01 < p \leq 0.05\)**
next to the p-value indicates \(0.001 < p \leq 0.01\)***
next to the p-value indicates \(0 < p \leq 0.001\)This is helpful when looking to see if your coefficients are significant at different levels.
How can we interpret these results? Let’s start with \(\beta_0\). For a male student with a GPA of 0, the expected SAT score is 55.85. For a female student with a GPA of 0, the expected SAT score is 71.28. In addition, for any GPA, we can compute the expected SAT score. We can write a function to do this for us.
Thankfully, R has a pre-defined function called predict()
. This function accepts two arguments: a lm
object and newdata
. newdata
must have the same column names as variables in the lm
object.
hs_gpa sat_m sat_f
1 2.00 83.28204 93.31803
2 2.25 86.71088 96.07282
3 2.50 90.13971 98.82762
4 2.75 93.56855 101.58242
5 3.00 96.99738 104.33721
6 3.25 100.42622 107.09201
7 3.50 103.85505 109.84680
8 3.75 107.28389 112.60160
9 4.00 110.71272 115.35639
It seems that female students consistently outperform male students on the SAT, according to these data.
Another helpful way to look at this would be to plot both lines together with the underlying data.
In this plot, it’s very clear that the biggest difference between male and female students is when GPAs are relatively low. The gap then decreases as GPAs get larger.
While summary()
provides helpful information about our regressions, the output is not very aesthetically pleasing. We are going to revisit the modelsummary
package, and use the titular function modelsummary()
. This function takes many arguments, but for simplicity, we will just discuss a few:
models
: A named list of models we want to include in the table.title
: The title of our table.estimate
: This is a bit complicated, but we are going to set this equal to "{estimate}{stars}"
. This way, the table will display coefficients, standard errors, and stars to denote statistical significance. This is the norm. If you do not like the stars, you can remove them by not including estimate
in the function call.coef_map
: A named vector variable names. This will rename the variable names from what they are labeled as in the data to something more human readable.gof_map
: A vector of desired goodness of fit measures for each model. We are going to stick with "nobs"
and "r.squared"
for the time being.library("modelsummary")
options("modelsummary_factory_default" = "kableExtra")
regz <- list(`All Students` = reg_sat_gpa,
`Male` = reg_sat_gpa_m,
`Female` = reg_sat_gpa_f)
coefz <- c("hs_gpa" = "High School GPA",
"(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
title = "Effect of GPA on SAT Scores",
estimate = "{estimate}{stars}",
coef_map = coefz,
gof_map = gofz)
All Students | Male | Female | |
---|---|---|---|
High School GPA | 11.363*** | 13.715*** | 11.019*** |
(0.754) | (1.079) | (1.003) | |
Constant | 66.988*** | 55.851*** | 71.280*** |
(2.444) | (3.579) | (3.183) | |
Num.Obs. | 1000 | 484 | 516 |
R2 | 0.186 | 0.251 | 0.190 |
Location, Location, Location
Real estate agents will tell you that the most important factor in determining the price of a property is location. Whether or not this is true, there are obviously other things that contriibute to a property’s value. For example, square footage, number of bedrooms, age, condition, etc. all play a role in determining a home’s value.
We are going to estimate a bunch of regressions using the ames
data to demonstrate economic interpretations of OLS coefficients.
Let’s rename some columns and take a peak at our data.
Next, let’s use modelsummary
to create a summary statistics table.
Interpreting the summary statistics table:
price
: This appears to be measured in dollars with a lot of variation (e.g. about 1000 unique values out of about 3000 observations). The average sale price is $180,000 with a standard deviation of $80,000. The standard deviation is quite high relative to the mean, meaning the distribution is very wide. It’s likely there are a lot of outliers in the right tail, which is typical of housing price data.sqft
: This exhibits very similar characteristics as the price
variable. The mean is 1,500 with a long right tail, which means there are a few very large homes.bedrooms
: This variable only takes on 8 unique values, meaning there is not much variation. The average home has about three bedrooms, which matches my prior expectations of average houses. Once again, there is likely a long right tail, evidenced by the maximum of eight bedrooms.Interpreting the summary statistics table (continued):
yr_built
: This variable has a maximum of 2010, which would be brand new construction, but a mean of about 1970. The minimum value is 1870, which would be quite an old home.condition
: This variable has a mean and median of about 5. If 5 means “average condition”, then this would make sense. However, any quantitative interpretation of this variable is effectively meaningless. What does it mean for a home to improve from a 6 to a 7? Is this the same as moving from a 3 to a 4? This is an ordinal variable, and should not be considered in a linear regression. We will explore this variable nonetheless.It’s always helpful to create distribution plots for your main variables. Sometimes, this step can help inform you when making modeling decisions. For example, when variables have long right tails, you’ll often see people use a log
transformation.
What would the log transformation do to our models? Let’s write down four models, visualize the data, and examine estimated parameters:
Of course, log() is a non-linear transformation. However, that does not go against our definition of a linear model. For a model to be linear, all variables, no matter their transformation, must enter into the model linearly. Therefore, we can have anything of the form \(Y = \alpha + \beta X + \epsilon\), even if \(X\) is non-linear. In fact, we could estimate a model like: \(Y = \alpha + \beta^X + \epsilon\) following a (relatively) simple log transformation.
Next, to estimate each equation:
r_price_sqft <- lm(price ~ sqft, ames)
r_lprice_sqft <- lm(log(price) ~ sqft, ames)
r_price_lsqft <- lm(price ~ log(sqft), ames)
r_lprice_lsqft <- lm(log(price) ~ log(sqft), ames)
regz <- list(`Level - Level` = r_price_sqft,
`Log - Level` = r_lprice_sqft,
`Level - Log` = r_price_lsqft,
`Log - Log` = r_lprice_lsqft)
coefz <- c("sqft" = "Sq. Ft.",
"log(sqft)" = "log(Sq. Ft.)",
"(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
title = "Effect of Sq. Ft. on Sale Price",
estimate = "{estimate}{stars}",
coef_map = coefz,
gof_map = gofz)
Level - Level | Log - Level | Level - Log | Log - Log | |
---|---|---|---|---|
Sq. Ft. | 111.694*** | 0.001*** | ||
(2.066) | (0.000) | |||
log(Sq. Ft.) | 171010.918*** | 0.908*** | ||
(3269.113) | (0.016) | |||
Constant | 13289.634*** | 11.180*** | −1060765.031*** | 5.430*** |
(3269.703) | (0.017) | (23757.890) | (0.116) | |
Num.Obs. | 2930 | 2930 | 2930 | 2930 |
R2 | 0.500 | 0.484 | 0.483 | 0.523 |
Interpreting the coefficients of the first model (Level - Level) is similar to how we interpreted the first model of SAT scores and GPAs. If we increase the size of a property by one square foot, we would expect the price to increase by $111.69. Ultimately, this is a very small amount relative to the average and standard deviation of sale price. However, an increase of one square foot is also small relative to the mean and standard deviation of square footage.
When people contemplate additions to their homes, they usually consider adding whole rooms. For the sake of argument, suppose rooms are about 200 square feet. Then, an additional room’s worth of square footage would increase sale price by $22,338, which is a much more intuitive number.
To interpret the constant, we have to ask ourselves whether it makes sense for square footage to be equal to zero. In reality, yes, and that could be interpreted as the value of the land the property is sitting on. However, since the minimum square footage in the data is 334, we should avoid interpreting the constant in this case.
Interpreting models with logarithms switches the units from dollars or square feet to percentages. For example, we would interpret the coefficient in the second column as follows: if the property’s size increases by one square foot, we would expect price to increase by 0.056%1.
As another way to think about this, we know how a one unit increase in square footage would impact price – it would increase it by $111.69. Relative to the average house price ($180,796.1), this is 0.062%. These two models generate very similar output, but put it differently.
To interpret the level-log model, we would say: if square footage increases by one percent, the sale price would increase by $1,710.11. Again, we are going to avoid interpreting the constant in this case.
Finally, to interpret the last model, both variables’ units are changed to percentages. This model says that if the size of a property increases by one percent, its sale price is expected to increase by 0.9%.
Importantly, this coefficient is an elasticity! In other words, it measures the responsiveness of one variable to another. In this case, since the elasticity is less than 1%, sale price is inelastic with respect to property size.
Think back to hypothesis testing where we tested if a coefficient was different from zero. To do this, we divided the coefficient (minus 0) by its standard error. This gave us a t-statistic that we could then convert into a p-value. In fact, R does all of this for us in summary()
, and modelsummary
produces stars to represent p-values.
Now, instead of comparing our coefficient to 0 (which would tell us whether the independent variable is related to the outcome variable), we could compare it to 1. This would set up unit elasticity as the null hypothesis. Why would we do this? If we can reject that the coefficient is equal to 1, we would have statistical evidence that price is indeed inelastic with respect to square footage. It’s important to note that this is much stonger than saying the relationship is inelastic because the coefficient is less than 1.
What would a hypothesis test look like then?
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4301860 0.11644484 46.63312 0
log(sqft) 0.9078053 0.01602294 56.65659 0
p-value: 0.00000000872
Given this p-value, we can reject the null hypotheses that \(\beta_1 = 1\) (in addition to previously rejecting that \(\beta_1 = 0\)). Therefore, the data seem to support the idea that price is inelastic, or less than proportionally responsive, to square footage.
So what? Who cares?
Suppose a contractor tells you that it will cost $\(x\) dollars to increase the size of your house by 25%. Assuming you want to sell this home, and the current expected sale price is $200,000, for what values of \(x\) should you expand your home? According to the model, a 25% increase in the size of a home is worth an increase in sale price of 25% \(\times\) 0.9. On the open market, this addition would increase your sale value by: $200,000 \(\times\) 0.25 \(\times\) 0.9 = $45,000. Therefore, investing in the addition would only be worthwhile if X < 45,000.
As a final note, examine the R\(^2\) for each of these estimations. The fourth model fits the data the best followed by the first model. This might not be too surprising after taking a look at the initial scatterplots. The visual relationships in the level-level and log-log plots appear to be the most “linear”. Compare these with the other two plots which appear to be convex or concave.
Remember, R\(^2\) is not the be-all end-all for determining which model is best. However, this does provide us with some idea of which functional form (log-log) we should be partial to.
Below is a table to help you remember how to interpret each type of log-model.
Model | Equation | Interpretation |
---|---|---|
Level-Level | \(Y = \beta_0 + \beta_1 X\) | One unit change in \(X\) leads to a \(\beta\) unit change in \(Y\). |
Log-Linear | \(\text{log}(Y) = \beta_0 + \beta_1 X\) | One unit change in \(X\) leads to a \(\beta \times 100\) percent change in \(Y\).1 |
Linear-Log | \(Y = \beta_0 + \beta_1 \text{log}(X)\) | One percent change in \(X\) leads to a \(\beta \div 100\) unit change in \(Y\). |
Log-Log | \(\text{log}(Y) = \beta_0 + \beta_1 \text{log}(X)\) | One percent change in \(X\) leads to a \(\beta\) percent change in \(Y\). |
Next, let’s estimate models with different explanatory variables. First, we’ll estimate the relationship between sale price and number of bedrooms.
r_price_bedrooms <- lm(price ~ bedrooms, ames)
r_lprice_bedrooms <- lm(log(price) ~ bedrooms, ames)
regz <- list(`Price` = r_price_bedrooms,
`log(Price)` = r_lprice_bedrooms)
coefz <- c("bedrooms" = "# of Bedrooms",
"(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
title = "Effect of Bedrooms on Sale Price",
estimate = "{estimate}{stars}",
coef_map = coefz,
gof_map = gofz)
Price | log(Price) | |
---|---|---|
# of Bedrooms | 13889.495*** | 0.089*** |
(1765.042) | (0.009) | |
Constant | 141151.743*** | 11.767*** |
(5245.395) | (0.027) | |
Num.Obs. | 2930 | 2930 |
R2 | 0.021 | 0.033 |
Breaking down the output:
***
).Next, we can estimate models where a home’s condition is the explanatory variable.
r_price_condition <- lm(price ~ condition, ames)
r_lprice_condition <- lm(log(price) ~ condition, ames)
regz <- list(`Price` = r_price_condition,
`log(Price)` = r_lprice_condition)
coefz <- c("condition" = "# of Condition",
"(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
title = "Effect of Condition on Sale Price",
estimate = "{estimate}{stars}",
coef_map = coefz,
gof_map = gofz)
Price | log(Price) | |
---|---|---|
# of Condition | −7309.010*** | −0.018** |
(1321.319) | (0.007) | |
Constant | 221457.104*** | 12.119*** |
(7495.923) | (0.038) | |
Num.Obs. | 2930 | 2930 |
R2 | 0.010 | 0.002 |
First of all, using the condition variable in the regression does not lead to interpretable coefficients. Condition is an ordinal (and subjective) variable, which makes a one unit change meaningless.
That said, the results of these models are suspicious. Since higher numbers mean better condition, the coefficient estimates should still be positive even if it’s an ordinal variable. Remember the four reasons you might find a correlation between two variables. In this case, there is likely a third variable that is driving both condition and sale price.
Let’s try to investigate and figure out why this might be.
First, let’s check out the distribution of the condition variable.
Clearly, there are many homes with a condition of 5. In fact, 56% of homes are rated to be a 5.
Next, let’s examine the average sale price for each possible value for condition.
There appears to be an upward trend in price as condition increases. However, there is an outlier value when condition equals 5. There must be a bunch of high price homes that have been given a condition of 5.
Theoretically, the condition of a home should be related to its age (or the year it was built). Let’s examine the relationship between year built and condition by finding the average condition for each possible year built.
There seems to be a negative relationship between year built and condition. However, the correlation between year built and price is positive. It’s likely that year built is driving both price and condition simultaneously. The negative coefficient between condition and price is due to the negative relationship between year built and condition.
What do models look like where year built is the main explanatory variable? Using the year a home was built does not make as much intuitive sense as thinking about the age of a home. Since all of these homes were sold between 2006 and 2010, according to the data description, we can calculate age of a property by subtracting the year built from 2011. Let’s run some models with age as the explanatory variable.
ames$age <- 2011 - ames$yr_built
r_price_age <- lm(price ~ age, ames)
r_lprice_age <- lm(log(price) ~ age, ames)
r_price_lage <- lm(price ~ log(age), ames)
r_lprice_lage <- lm(log(price) ~ log(age), ames)
regz <- list(`Level - Level` = r_price_age,
`Log - Level` = r_lprice_age,
`Level - Log` = r_price_lage,
`Log - Log` = r_lprice_lage)
coefz <- c("age" = "Age",
"log(age)" = "log(Age)",
"(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
title = "Effect of Age on Sale Price",
estimate = "{estimate}{stars}",
coef_map = coefz,
gof_map = gofz)
Level - Level | Log - Level | Level - Log | Log - Log | |
---|---|---|---|---|
Age | −1474.964*** | −0.008*** | ||
(40.493) | (0.000) | |||
log(Age) | −47030.296*** | −0.251*** | ||
(1098.021) | (0.005) | |||
Constant | 239269.065*** | 12.350*** | 333731.719*** | 12.838*** |
(2018.987) | (0.010) | (3753.501) | (0.019) | |
Num.Obs. | 2930 | 2930 | 2930 | 2930 |
R2 | 0.312 | 0.379 | 0.385 | 0.423 |
Interpreting these models:
A final point for this module is what happens to coefficient estimates when apply different transformations before estimation. So far, we’ve shown the effect of using a log() transformation. Now we’ll look at the effect of addition/subtraction and multiplication/division.
r_price_sqft <- lm(price ~ sqft, ames)
r_price_sqftdiv <- lm(price ~ I(sqft/1000), ames)
r_pricediv_sqft <- lm((price/1000) ~ sqft, ames)
r_price_sqftsub <- lm(price ~ I(sqft - 1500), ames)
r_pricesub_sqft <- lm(I(price - 180000) ~ sqft, ames)
regz <- list(`Original` = r_price_sqft,
`Sq Ft. / 1000` = r_price_sqftdiv,
`Price / 1000` = r_pricediv_sqft,
`Sq Ft. - 1500` = r_price_sqftsub,
`Price - 180K` = r_pricesub_sqft)
coefz <- c("sqft" = "Sq Ft.",
"I(sqft/1000)" = "Sq Ft.",
"I(sqft - 1500)" = "Sq Ft.")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
title = "Variable Manipulation",
estimate = "{estimate}{stars}",
coef_map = coefz,
gof_map = gofz)
Original | Sq Ft. / 1000 | Price / 1000 | Sq Ft. - 1500 | Price - 180K | |
---|---|---|---|---|---|
Sq Ft. | 111.694*** | 111694.001*** | 0.112*** | 111.694*** | 111.694*** |
(2.066) | (2066.073) | (0.002) | (2.066) | (2.066) | |
Num.Obs. | 2930 | 2930 | 2930 | 2930 | 2930 |
R2 | 0.500 | 0.500 | 0.500 | 0.500 | 0.500 |
ECON 311: Economics, Causality, and Analytics