One X Variable

Module 3.3: OLS in R

Alex Cardazzi

Old Dominion University

OLS in R

In the last set of notes, we derived the OLS estimator, calculated statistics for hypothesis tests, and measures for goodness of fit. In this set of notes, we are going to talk more about application.

First, to estimate a linear regression model, we will use the lm() function.

lm(), which stands for “linear model”, accepts the following arguments:

  • formula: Formulas in R take on the following form: y ~ x1 + x2. In the case for GPAs and SAT scores, we would write: sat_sum ~ hs_gpa.
  • data: This is the data you plan to use. This tells R where to look for the variables in your formula.
  • subset: An optional vector that specifies a subset of observations used to fit the line. For example, maybe we want to fit two lines: one for men and one for women. We would use subset = df$sex == 1 to estimate the model using only data from female students. Another reason to use this would be to eliminate outliers, etc.

OLS in R

We are going to practice on two sets of data. First, we are going to keep examining the GPA data (data; documentation). Second, we are going to explore some data on real estate transactions in Ames, Iowa (data; documentation). Let’s read both sets of data into R:

Code
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/satgpa.csv")
ames <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/ames.csv")

There are a lot of columns in the ames data. We are only interested in a few (for now), so I am going to subset the data to keep only the columns we’ll use.

Code
keep_columns <- c("price", "area", "Bedroom.AbvGr", "Year.Built", "Overall.Cond")
ames <- ames[,keep_columns]

OLS in R

Let’s estimate the SAT and GPA model from the previous section of notes.

Code
reg_sat_gpa <- lm(sat_sum ~ hs_gpa, data = df)
reg_sat_gpa
Output

Call:
lm(formula = sat_sum ~ hs_gpa, data = df)

Coefficients:
(Intercept)       hs_gpa  
      66.99        11.36  

The output of lm() returns the call you used to generate the output in addition to the estimated coefficients. These coefficients are exactly what we previously estimated with cov(), var(), cor(), etc. Not only does lm() output the coefficients, but it also returns both the fitted values (reg_sat_gpa$fitted.values) and the residuals/errors (reg_sat_gpa$residuals).

Code
head(data.frame(observed_y = df$sat_sum,
                fitted_y = reg_sat_gpa$fitted.values,
                residual_y = reg_sat_gpa$residuals))
Output
  observed_y fitted_y residual_y
1        127 105.6232  21.376776
2        122 112.4411   9.558875
3        116 109.6003   6.399667
4         95 109.6003 -14.600333
5        107 112.4411  -5.441125
6        111 112.4411  -1.441125

OLS in R

We can plot our regression line using the abline() function.

Code
library("scales")
plot(df$hs_gpa, df$sat_sum, las = 1, pch = 19,
     col = alpha("black", 0.2),
     xlab = "GPA", ylab = "SAT")
abline(reg_sat_gpa, lty = 1, col = "gold", lwd = 4)
Plot

OLS in R

Using the summary() function generates some important information about our regression and the individual coefficients.

Code
summary(reg_sat_gpa)
Output

Call:
lm(formula = sat_sum ~ hs_gpa, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.055  -8.669  -0.351   8.803  34.240 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  66.9885     2.4441   27.41   <2e-16 ***
hs_gpa       11.3632     0.7535   15.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.9 on 998 degrees of freedom
Multiple R-squared:  0.1856,    Adjusted R-squared:  0.1848 
F-statistic: 227.4 on 1 and 998 DF,  p-value: < 2.2e-16

In the table portion, each coefficient has an Estimate, Std. Errors, t value, and Pr(>|t|) In addition, you can find the Multiple R-Squared value, which is the measure for goodness of fit.

OLS in R

Regression for Male Students:

Code
reg_sat_gpa_m <- lm(sat_sum ~ hs_gpa,
                    data = df, subset = df$sex == 2)
summary(reg_sat_gpa_m)
Output

Call:
lm(formula = sat_sum ~ hs_gpa, data = df, subset = df$sex == 
    2)

Residuals:
    Min      1Q  Median      3Q     Max 
-31.997  -8.305  -0.455   8.574  33.287 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   55.851      3.579   15.61   <2e-16 ***
hs_gpa        13.715      1.079   12.71   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.33 on 482 degrees of freedom
Multiple R-squared:  0.2512,    Adjusted R-squared:  0.2496 
F-statistic: 161.7 on 1 and 482 DF,  p-value: < 2.2e-16

Regression for Female Students:

Code
reg_sat_gpa_f <- lm(sat_sum ~ hs_gpa,
                    data = df, subset = df$sex == 1)
summary(reg_sat_gpa_f)
Output

Call:
lm(formula = sat_sum ~ hs_gpa, data = df, subset = df$sex == 
    1)

Residuals:
    Min      1Q  Median      3Q     Max 
-45.497  -8.323   0.153   8.877  31.153 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   71.280      3.183   22.39   <2e-16 ***
hs_gpa        11.019      1.003   10.98   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.55 on 514 degrees of freedom
Multiple R-squared:  0.1901,    Adjusted R-squared:  0.1885 
F-statistic: 120.6 on 1 and 514 DF,  p-value: < 2.2e-16

OLS in R

Take a look at the Pr(>|t|) column. This contains a p-value corresponding to the hypothesis test that \(\beta_i \neq 0\). The value <2e-16 means that the p-value is smaller than 0.0000000000000002. R doesn’t report p-values that are smaller than this for brevity. In addition, R includes asterisks next to p-values that are statistically significant.

  • Nothing next to the p-value indicates \(0.1 < p \leq 1\).
  • . next to the p-value indicates \(0.05 < p \leq 0.1\)
  • * next to the p-value indicates \(0.01 < p \leq 0.05\)
  • ** next to the p-value indicates \(0.001 < p \leq 0.01\)
  • *** next to the p-value indicates \(0 < p \leq 0.001\)

This is helpful when looking to see if your coefficients are significant at different levels.

OLS in R

How can we interpret these results? Let’s start with \(\beta_0\). For a male student with a GPA of 0, the expected SAT score is 55.85. For a female student with a GPA of 0, the expected SAT score is 71.28. In addition, for any GPA, we can compute the expected SAT score. We can write a function to do this for us.

OLS in R

Thankfully, R has a pre-defined function called predict(). This function accepts two arguments: a lm object and newdata. newdata must have the same column names as variables in the lm object.

Code
# Create a new data.frame of GPAs from 2 to 4 by 0.25 increments.
pred_df <- data.frame(hs_gpa = seq(2, 4, by = 0.25))
pred_df$sat_m <- predict(reg_sat_gpa_m, newdata = pred_df)
pred_df$sat_f <- predict(reg_sat_gpa_f, newdata = pred_df)
pred_df
Output
  hs_gpa     sat_m     sat_f
1   2.00  83.28204  93.31803
2   2.25  86.71088  96.07282
3   2.50  90.13971  98.82762
4   2.75  93.56855 101.58242
5   3.00  96.99738 104.33721
6   3.25 100.42622 107.09201
7   3.50 103.85505 109.84680
8   3.75 107.28389 112.60160
9   4.00 110.71272 115.35639

It seems that female students consistently outperform male students on the SAT, according to these data.

OLS in R

Another helpful way to look at this would be to plot both lines together with the underlying data.

In this plot, it’s very clear that the biggest difference between male and female students is when GPAs are relatively low. The gap then decreases as GPAs get larger.

OLS in R

While summary() provides helpful information about our regressions, the output is not very aesthetically pleasing. We are going to revisit the modelsummary package, and use the titular function modelsummary(). This function takes many arguments, but for simplicity, we will just discuss a few:

  • models: A named list of models we want to include in the table.
  • title: The title of our table.
  • estimate: This is a bit complicated, but we are going to set this equal to "{estimate}{stars}". This way, the table will display coefficients, standard errors, and stars to denote statistical significance. This is the norm. If you do not like the stars, you can remove them by not including estimate in the function call.
  • coef_map: A named vector variable names. This will rename the variable names from what they are labeled as in the data to something more human readable.
  • gof_map: A vector of desired goodness of fit measures for each model. We are going to stick with "nobs" and "r.squared" for the time being.

OLS in R

Code
library("modelsummary")
options("modelsummary_factory_default" = "kableExtra")

regz <- list(`All Students` = reg_sat_gpa,
             `Male` = reg_sat_gpa_m,
             `Female` = reg_sat_gpa_f)
coefz <- c("hs_gpa" = "High School GPA",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Effect of GPA on SAT Scores",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of GPA on SAT Scores
All Students Male Female
High School GPA 11.363*** 13.715*** 11.019***
(0.754) (1.079) (1.003)
Constant 66.988*** 55.851*** 71.280***
(2.444) (3.579) (3.183)
Num.Obs. 1000 484 516
R2 0.186 0.251 0.190

Ames, Iowa

Location, Location, Location

Real estate agents will tell you that the most important factor in determining the price of a property is location. Whether or not this is true, there are obviously other things that contriibute to a property’s value. For example, square footage, number of bedrooms, age, condition, etc. all play a role in determining a home’s value.

A note about endogeneity. If you’ve taken Transportation Economics, you’ll be familiar with the Monocentric City Model. In this model, housing characteristics, specifically square footage, is endogenously chosen. In other words, distance to the city center will change the size and price of dwellings. So, if you find a correlation between size and price, there might a third variable lurking (i.e., distance to a city’s center) that determines both. Therefore, the correlation or regression between square footage (or most other housing characteristics) and price is not causal.

We are going to estimate a bunch of regressions using the ames data to demonstrate economic interpretations of OLS coefficients.

Summary Statistics

Let’s rename some columns and take a peak at our data.

Code
colnames(ames) <- c("price", "sqft", "bedrooms", "yr_built", "condition")
View(head(ames, 100)) # View() in RStudio will allow you to see the data in a tab.

Summary Statistics

Next, let’s use modelsummary to create a summary statistics table.

Code
datasummary_skim(ames)
Unique Missing Pct. Mean SD Min Median Max
price 1032 0 180796.1 79886.7 12789.0 160000.0 755000.0
sqft 1292 0 1499.7 505.5 334.0 1442.0 5642.0
bedrooms 8 0 2.9 0.8 0.0 3.0 8.0
yr_built 118 0 1971.4 30.2 1872.0 1973.0 2010.0
condition 9 0 5.6 1.1 1.0 5.0 9.0

Summary Statistics

Interpreting the summary statistics table:

  • price: This appears to be measured in dollars with a lot of variation (e.g. about 1000 unique values out of about 3000 observations). The average sale price is $180,000 with a standard deviation of $80,000. The standard deviation is quite high relative to the mean, meaning the distribution is very wide. It’s likely there are a lot of outliers in the right tail, which is typical of housing price data.
  • sqft: This exhibits very similar characteristics as the price variable. The mean is 1,500 with a long right tail, which means there are a few very large homes.
  • bedrooms: This variable only takes on 8 unique values, meaning there is not much variation. The average home has about three bedrooms, which matches my prior expectations of average houses. Once again, there is likely a long right tail, evidenced by the maximum of eight bedrooms.

Summary Statistics

Interpreting the summary statistics table (continued):

  • yr_built: This variable has a maximum of 2010, which would be brand new construction, but a mean of about 1970. The minimum value is 1870, which would be quite an old home.
  • condition: This variable has a mean and median of about 5. If 5 means “average condition”, then this would make sense. However, any quantitative interpretation of this variable is effectively meaningless. What does it mean for a home to improve from a 6 to a 7? Is this the same as moving from a 3 to a 4? This is an ordinal variable, and should not be considered in a linear regression. We will explore this variable nonetheless.

Summary Statistics

It’s always helpful to create distribution plots for your main variables. Sometimes, this step can help inform you when making modeling decisions. For example, when variables have long right tails, you’ll often see people use a log transformation.

Code
par(mfrow = c(1, 2))
plot(table(round(ames$price/10000)) / nrow(ames),
     xlab = "Price", ylab = "Rel. Freq.")
# plot(table(round(ames$sqft/100)))
plot(table(round(log(ames$price), 1)) / nrow(ames),
     xlab = "log(Price)", ylab = "Rel. Freq.")
par(mfrow = c(1, 1))
Plot

Log Models

What would the log transformation do to our models? Let’s write down four models, visualize the data, and examine estimated parameters:

  • Level - Level: \(\text{Price}_i = \alpha_0 + \alpha_1 \times \text{Sq. Ft.}_i + \epsilon_i\)
  • Log - Level: \(log(\text{Price}_i) = \beta_0 + \beta_1 \times \text{Sq. Ft.}_i + \epsilon_i\)
  • Level - Log: \(\text{Price}_i = \gamma_0 + \gamma_1 \times log(\text{Sq. Ft.}_i) + \epsilon_i\)
  • Log - Log: \(log(\text{Price}_i) = \delta_0 + \delta_1 \times log(\text{Sq. Ft.}_i) + \epsilon_i\)

Of course, log() is a non-linear transformation. However, that does not go against our definition of a linear model. For a model to be linear, all variables, no matter their transformation, must enter into the model linearly. Therefore, we can have anything of the form \(Y = \alpha + \beta X + \epsilon\), even if \(X\) is non-linear. In fact, we could estimate a model like: \(Y = \alpha + \beta^X + \epsilon\) following a (relatively) simple log transformation.

Log Models

Plot

Log Models

Next, to estimate each equation:

Code
r_price_sqft <- lm(price ~ sqft, ames)
r_lprice_sqft <- lm(log(price) ~ sqft, ames)
r_price_lsqft <- lm(price ~ log(sqft), ames)
r_lprice_lsqft <- lm(log(price) ~ log(sqft), ames)

regz <- list(`Level - Level` = r_price_sqft,
             `Log - Level` = r_lprice_sqft,
             `Level - Log` = r_price_lsqft,
             `Log - Log` = r_lprice_lsqft)
coefz <- c("sqft" = "Sq. Ft.",
           "log(sqft)" = "log(Sq. Ft.)",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Effect of Sq. Ft. on Sale Price",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of Sq. Ft. on Sale Price
Level - Level Log - Level Level - Log Log - Log
Sq. Ft. 111.694*** 0.001***
(2.066) (0.000)
log(Sq. Ft.) 171010.918*** 0.908***
(3269.113) (0.016)
Constant 13289.634*** 11.180*** −1060765.031*** 5.430***
(3269.703) (0.017) (23757.890) (0.116)
Num.Obs. 2930 2930 2930 2930
R2 0.500 0.484 0.483 0.523

Interpreting \(\beta\)

Interpreting the coefficients of the first model (Level - Level) is similar to how we interpreted the first model of SAT scores and GPAs. If we increase the size of a property by one square foot, we would expect the price to increase by $111.69. Ultimately, this is a very small amount relative to the average and standard deviation of sale price. However, an increase of one square foot is also small relative to the mean and standard deviation of square footage.

When people contemplate additions to their homes, they usually consider adding whole rooms. For the sake of argument, suppose rooms are about 200 square feet. Then, an additional room’s worth of square footage would increase sale price by $22,338, which is a much more intuitive number.

To interpret the constant, we have to ask ourselves whether it makes sense for square footage to be equal to zero. In reality, yes, and that could be interpreted as the value of the land the property is sitting on. However, since the minimum square footage in the data is 334, we should avoid interpreting the constant in this case.

Interpreting \(\beta\)

Interpreting models with logarithms switches the units from dollars or square feet to percentages. For example, we would interpret the coefficient in the second column as follows: if the property’s size increases by one square foot, we would expect price to increase by 0.056%1.

As another way to think about this, we know how a one unit increase in square footage would impact price – it would increase it by $111.69. Relative to the average house price ($180,796.1), this is 0.062%. These two models generate very similar output, but put it differently.

Interpreting \(\beta\)

To interpret the level-log model, we would say: if square footage increases by one percent, the sale price would increase by $1,710.11. Again, we are going to avoid interpreting the constant in this case.

Interpreting \(\beta\)

Finally, to interpret the last model, both variables’ units are changed to percentages. This model says that if the size of a property increases by one percent, its sale price is expected to increase by 0.9%.

Importantly, this coefficient is an elasticity! In other words, it measures the responsiveness of one variable to another. In this case, since the elasticity is less than 1%, sale price is inelastic with respect to property size.

Hypothesis Testing

Think back to hypothesis testing where we tested if a coefficient was different from zero. To do this, we divided the coefficient (minus 0) by its standard error. This gave us a t-statistic that we could then convert into a p-value. In fact, R does all of this for us in summary(), and modelsummary produces stars to represent p-values.

Now, instead of comparing our coefficient to 0 (which would tell us whether the independent variable is related to the outcome variable), we could compare it to 1. This would set up unit elasticity as the null hypothesis. Why would we do this? If we can reject that the coefficient is equal to 1, we would have statistical evidence that price is indeed inelastic with respect to square footage. It’s important to note that this is much stonger than saying the relationship is inelastic because the coefficient is less than 1.

Hypothesis Testing

What would a hypothesis test look like then?

Code
coefz <- coef(summary(r_lprice_lsqft))
coefz; cat("\n")
test_stat <- (coefz[2,1] - 1) / coefz[2,2]
cat("p-value:", format(2 * pnorm(abs(test_stat), lower.tail = F), scientific = F, digits = 3))
Output
             Estimate Std. Error  t value Pr(>|t|)
(Intercept) 5.4301860 0.11644484 46.63312        0
log(sqft)   0.9078053 0.01602294 56.65659        0

p-value: 0.00000000872

Given this p-value, we can reject the null hypotheses that \(\beta_1 = 1\) (in addition to previously rejecting that \(\beta_1 = 0\)). Therefore, the data seem to support the idea that price is inelastic, or less than proportionally responsive, to square footage.

Hypothesis Testing

So what? Who cares?

Suppose a contractor tells you that it will cost $\(x\) dollars to increase the size of your house by 25%. Assuming you want to sell this home, and the current expected sale price is $200,000, for what values of \(x\) should you expand your home? According to the model, a 25% increase in the size of a home is worth an increase in sale price of 25% \(\times\) 0.9. On the open market, this addition would increase your sale value by: $200,000 \(\times\) 0.25 \(\times\) 0.9 = $45,000. Therefore, investing in the addition would only be worthwhile if X < 45,000.

Goodness of Fit

As a final note, examine the R\(^2\) for each of these estimations. The fourth model fits the data the best followed by the first model. This might not be too surprising after taking a look at the initial scatterplots. The visual relationships in the level-level and log-log plots appear to be the most “linear”. Compare these with the other two plots which appear to be convex or concave.

Remember, R\(^2\) is not the be-all end-all for determining which model is best. However, this does provide us with some idea of which functional form (log-log) we should be partial to.

Log Interpretation Table

Below is a table to help you remember how to interpret each type of log-model.

Coefficient Interpretation
Model Equation Interpretation
Level-Level \(Y = \beta_0 + \beta_1 X\) One unit change in \(X\) leads to a \(\beta\) unit change in \(Y\).
Log-Linear \(\text{log}(Y) = \beta_0 + \beta_1 X\) One unit change in \(X\) leads to a \(\beta \times 100\) percent change in \(Y\).1
Linear-Log \(Y = \beta_0 + \beta_1 \text{log}(X)\) One percent change in \(X\) leads to a \(\beta \div 100\) unit change in \(Y\).
Log-Log \(\text{log}(Y) = \beta_0 + \beta_1 \text{log}(X)\) One percent change in \(X\) leads to a \(\beta\) percent change in \(Y\).

Bedrooms

Next, let’s estimate models with different explanatory variables. First, we’ll estimate the relationship between sale price and number of bedrooms.

Code
r_price_bedrooms <- lm(price ~ bedrooms, ames)
r_lprice_bedrooms <- lm(log(price) ~ bedrooms, ames)

regz <- list(`Price` = r_price_bedrooms,
             `log(Price)` = r_lprice_bedrooms)
coefz <- c("bedrooms" = "# of Bedrooms",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Effect of Bedrooms on Sale Price",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of Bedrooms on Sale Price
Price log(Price)
# of Bedrooms 13889.495*** 0.089***
(1765.042) (0.009)
Constant 141151.743*** 11.767***
(5245.395) (0.027)
Num.Obs. 2930 2930
R2 0.021 0.033

Bedrooms

Breaking down the output:

  1. I did not estimate models where log(bedrooms) was the independent variable. This is because thinking about marginal changes in bedrooms as a percentage doesn’t make much sense. For example, what does it mean for a home to have a 1% increase in the number of bedrooms?
  2. All coefficients are statistically significant (or, different from zero) at the 0.001 level (***).
  3. The constant/intercept is not interpretable because a property with zero bedrooms cannot not exist.
  4. The slope coefficient in the first model suggests that an additional bedroom will increase price by about $14,000.
  5. The slope coefficient in the second model suggests that an additional bedroom will increase price by about 9%.

Home Condition

Next, we can estimate models where a home’s condition is the explanatory variable.

Code
r_price_condition <- lm(price ~ condition, ames)
r_lprice_condition <- lm(log(price) ~ condition, ames)

regz <- list(`Price` = r_price_condition,
             `log(Price)` = r_lprice_condition)
coefz <- c("condition" = "# of Condition",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Effect of Condition on Sale Price",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of Condition on Sale Price
Price log(Price)
# of Condition −7309.010*** −0.018**
(1321.319) (0.007)
Constant 221457.104*** 12.119***
(7495.923) (0.038)
Num.Obs. 2930 2930
R2 0.010 0.002

Home Condition

First of all, using the condition variable in the regression does not lead to interpretable coefficients. Condition is an ordinal (and subjective) variable, which makes a one unit change meaningless.

That said, the results of these models are suspicious. Since higher numbers mean better condition, the coefficient estimates should still be positive even if it’s an ordinal variable. Remember the four reasons you might find a correlation between two variables. In this case, there is likely a third variable that is driving both condition and sale price.

Let’s try to investigate and figure out why this might be.

Home Condition

First, let’s check out the distribution of the condition variable.

Code
plot(table(ames$condition), las = 1,
     main = "Distribution of Condition",
     xlab = "Condition", ylab = "Frequency")
Plot

Clearly, there are many homes with a condition of 5. In fact, 56% of homes are rated to be a 5.

Home Condition

Next, let’s examine the average sale price for each possible value for condition.

Code
agg <- aggregate(list(price = ames$price), list(condition = ames$condition), mean)
plot(agg$condition, agg$price/1000, las = 1, pch = 19,
     xlab = "Condition", ylab = "Price (in Thousands)")
Plot

There appears to be an upward trend in price as condition increases. However, there is an outlier value when condition equals 5. There must be a bunch of high price homes that have been given a condition of 5.

Home Condition

Theoretically, the condition of a home should be related to its age (or the year it was built). Let’s examine the relationship between year built and condition by finding the average condition for each possible year built.

Code
agg <- aggregate(list(condition = ames$condition), list(yr = ames$yr_built), mean)
plot(agg$yr, agg$condition, las = 1, pch = 19,
     col = alpha("mediumseagreen", 0.6),
     xlab = "Year Built", ylab = "Average Condition")
Plot

There seems to be a negative relationship between year built and condition. However, the correlation between year built and price is positive. It’s likely that year built is driving both price and condition simultaneously. The negative coefficient between condition and price is due to the negative relationship between year built and condition.

Home Age

What do models look like where year built is the main explanatory variable? Using the year a home was built does not make as much intuitive sense as thinking about the age of a home. Since all of these homes were sold between 2006 and 2010, according to the data description, we can calculate age of a property by subtracting the year built from 2011. Let’s run some models with age as the explanatory variable.

Code
ames$age <- 2011 - ames$yr_built
r_price_age <- lm(price ~ age, ames)
r_lprice_age <- lm(log(price) ~ age, ames)
r_price_lage <- lm(price ~ log(age), ames)
r_lprice_lage <- lm(log(price) ~ log(age), ames)

regz <- list(`Level - Level` = r_price_age,
             `Log - Level` = r_lprice_age,
             `Level - Log` = r_price_lage,
             `Log - Log` = r_lprice_lage)
coefz <- c("age" = "Age",
           "log(age)" = "log(Age)",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Effect of Age on Sale Price",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of Age on Sale Price
Level - Level Log - Level Level - Log Log - Log
Age −1474.964*** −0.008***
(40.493) (0.000)
log(Age) −47030.296*** −0.251***
(1098.021) (0.005)
Constant 239269.065*** 12.350*** 333731.719*** 12.838***
(2018.987) (0.010) (3753.501) (0.019)
Num.Obs. 2930 2930 2930 2930
R2 0.312 0.379 0.385 0.423

Home Age

Interpreting these models:

  1. Increasing the age of a home by one year reduces the sale price by $1,475.
  2. Increasing the age of a home by one year reduces the sale price by 0.8%.
  3. Increasing the age of a home by one percent reduces the sale price by $470.30.
  4. Increasing the age of a home by one percent reduces the sale price by 0.25%.

Other Manipulations

A final point for this module is what happens to coefficient estimates when apply different transformations before estimation. So far, we’ve shown the effect of using a log() transformation. Now we’ll look at the effect of addition/subtraction and multiplication/division.

Code
r_price_sqft <- lm(price ~ sqft, ames)
r_price_sqftdiv <- lm(price ~ I(sqft/1000), ames)
r_pricediv_sqft <- lm((price/1000) ~ sqft, ames)
r_price_sqftsub <- lm(price ~ I(sqft - 1500), ames)
r_pricesub_sqft <- lm(I(price - 180000) ~ sqft, ames)

regz <- list(`Original` = r_price_sqft,
             `Sq Ft. / 1000` = r_price_sqftdiv,
             `Price / 1000` = r_pricediv_sqft,
             `Sq Ft. - 1500` = r_price_sqftsub,
             `Price - 180K` = r_pricesub_sqft)
coefz <- c("sqft" = "Sq Ft.",
           "I(sqft/1000)" = "Sq Ft.",
           "I(sqft - 1500)" = "Sq Ft.")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Variable Manipulation",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Variable Manipulation
Original  Sq Ft. / 1000  Price / 1000  Sq Ft. - 1500  Price - 180K
Sq Ft. 111.694*** 111694.001*** 0.112*** 111.694*** 111.694***
(2.066) (2066.073) (0.002) (2.066) (2.066)
Num.Obs. 2930 2930 2930 2930 2930
R2 0.500 0.500 0.500 0.500 0.500

Other Manipulations

  1. If you divide (multiply) your explanatory variable by \(x\), the coefficient will be multiplied (divided) by \(x\).
  2. If you divide (multiply) your outcome variable by \(x\), the coefficient will be divided (multiplied) by \(x\).
  3. Adding or subtracting from either variable does not change the slope parameter.