One X Variable

Module 3.1: Correlation

Author
Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

Economic Relationships

Thus far, we have mostly described single variables. We have examined a variable’s distribution (central tendency, dispersion), made claims about the population it originates from (confidence intervals), and tested hypotheses we’ve developed. However, we are not usually interested in single variables by themselves. More often, economists are interested in relationships between two (or more) variables.

When two variables are in fact related, learning about one of them reveals something about the other one. For example, consider SAT and ACT scores. If you found out what your friend scored on the SAT, you could probably guess their ACT score within a few points.

The point of this module is to learn how to quantify and exploit these types of relationships.

Economic Models

Economists often create simplified theoretical models of the world to study how it works. A famous example of an economic model is the law of demand.

The quantity demanded of a product or service is a function of, or determined by, its price. As price increases, quantity demand will decrease.1

Of course, there are a lot of factors that impact the quantity demanded of a product or service. However, creating models that account for every little factor would be paralyzing. So, economists focus on what they believe to be the most important factors.

What are some general characteristics of economic models?

Economic models identify causal forces most relevant to a decision-maker’s behavior. It then follows that these models are therefore predictive. In the case of demand, price causes quantity demanded.

Economic models are intended to characterize important determinants of average behavior, not capture every little nuance.

Economic models generally describe direction of change, but not necessarily magnitude of change. In the case of demand, increases in price lead to decreases in quantity demanded.

One of the main points of this course is to turn qualitative economic models into quantitative statistical models.

When creating an economic model, one needs to consider two kinds of variables:

A variable that is determined outside of the system. In other words, these variables are as good as random. In the case of demand, this would be the price.

A variable that is determined inside of the system. These variables can be explained by, of functions of, the exogenous variables. In the case, of demand, this would be quantity demanded.

As a shortcut, you can think of “endogenous” variables as \(Y\) variables, and “exogenous” variables are \(X\) variables insofar that \(X\) causes \(Y\).

Once a theory is developed, it becomes important to test the theory. This is where econometrics comes in.

Before we get too far into the weeds:

  • There are different names for exogenous variables. If you see any of the following terms, they are all referring to exogenous variables: “control variables”, “independent variables”, \(x\)-variables, regressors, covariates, explanatory variables.
  • Generally, endogenous variables will be called \(y\)-variables, outcome variables, or dependent variables.

Before jumping to data, economists come up with mathematical expressions to explore their theories. As a simple example, consider the law of demand. In words, we have two conditions:

  1. Quantity demanded is determined by price.
  2. Quantity demanded decreases when price increases.

To “mathematize” the first condition, we can write \(Q_d\) as a function of \(P\): \(Q_d = f(P)\).The second condition implies something about \(f(P)\). Specifically, it suggests that the first derivative should be negative. Mathematically: \(\frac{\partial Q_d}{\partial P} < 0\).

There are many (infinite) functions one could choose from that would satisfy the two important conditions. However, to reduce a model’s complexity, economists will often “guess” particular functional forms for \(f()\). Perhaps the most simple function that relates \(Q_d\) and \(P\) in a way that satisfies the conditions is a linear relationship. Below is an example of a linear relationship between quantity demanded and price:

Plot

This line satisfies all of the theory of our law of demand. Here, price and quantity demanded are linked via this function, and the function has a negative slope.

As a note, 99% of the time, the variable that determines another is placed along the \(x\)-axis. Of course, it then follows that the dependent variable is then placed along the \(y\)-axis. However, Alfred Marshall popularized the convention of price being on the \(y\)-axis even though this is usually thought of as the variable that determines quantity demanded. See stackexchange for a more detailed explanation if interested. This is one of the few times you will see the \(x\) variable on the \(y\) axis.

It is important to note:

“All models are wrong, but some are useful.”

All models are going to be wrong because they cannot account for every possible scenario. For example: 2020’s pandemic, 2008’s financial crisis, and 2000’s dot com bubble are all examples of generally unforeseeable circumstances that escaped economic models. However, some models provide valuable insight into how the world operates. We will use math and statistics to help us decides which are at least useful.

Let’s suppose my theory is that students with higher GPAs in high school will score higher on their SAT. Now, we need a way to test this theory.

Let’s pick out our exogenous and endogenous variables. I contend that SAT scores are a function of, or are determined by, high school GPA. In reality, both of these outcomes are likely to be determined by things like natural ability, study habits, family income (i.e., tutoring, etc.), educational environment, etc. However, for now, let’s think about GPA as a proxy for all of these things, and SAT scores follow as a direct result of GPAs.

Another (and more realistic) way to think about this model is as a tool for guidance counselors to help their students understand what they can expect for their SAT scores given their GPA.

In this dataset, we observe SAT and GPA data for 1000 students (data, documentation) to examine the potential relationship.

As a first step, we should read in the data and take a look at it.

Note that the sat_sum variable is a sum of the students Math and Verbal percentiles. Therefore, the best student in the world would be 100th percentile for both, meaning sat_sum would be 200. A median student in each would have a value of 50 and 50, making sat_sum equal to 100.

Next, let’s visualize the relationship between sat_sum and hs_gpa using R’s plot() function.

An Aside

Indeed, this plot is pretty messy… The points are covering each other, everything is one solid color, and there’s hardly a discernible pattern. A first step you can take is to make the points more transparent to identify density.

To do this, we are going to install the scales package with the following: install.packages("scales"). Then, we are going to use library("scales") to load in the package. scales contains a function alpha() that accepts vectors of colors and opacities as arguments. Try playing around with colors and opacities below.

You do not need to use install.packages("scales") in this WebR chunk, but you do need to use it on your personal machine. Remember: you only need to install the package once!

This is a good first step, as our plots now show a bit of that density I mentioned. However, there are still so many points (1,000 to be exact) that the overall relationship might be a bit obscured.

Notice how there are many points in these vertical lines. This is because GPA is a relatively discrete measure. In other words, there are 35 unique values2 for 1000 observations. We can take advantage of this by collapsing the data down into these 35 points. To do this, we are going to calculate an average SAT for each unique GPA value.

One way we can do this is with a loop!

First, we’ll extract the unique GPA values. Next, we’ll create an empty vector where we will put our SAT averages for each GPA value. Finally, we’ll calculate the mean and store it.

Solution
Code
# gather unique GPAs
gpaz <- unique(df$hs_gpa)
# Create empty vector
satz <- rep(NA, length(gpaz))
for(gpa in gpaz){
  
  # Take the average of the subsetted SATs.
  temp <- mean(df$sat_sum[df$hs_gpa == gpa])
  # Store "temp" into satz where gpaz == gpa
  satz[gpaz == gpa] <- temp
}
plot(gpaz, satz, las = 1, pch = 19,
     xlab = "HS GPA", ylab = "SAT Percentile Sum")
Plot

With this plot, one can start to pick out the relationship between GPA and SAT even if only visually. However, having to write a loop to generate these data is unnecessarily difficult. Luckily, R has a function called aggregate() that will do exactly what we wanted, but with much less typing.

aggregate() accepts a few main arguments:

  • x: The data you want to aggregate/collapse. In our case, sat_sum.
  • by: The level you want to aggregate/collapse to. In our case, hs_gpa.
  • FUN: The function to be applied to x for each unique by. In our case, mean().

Below is an example of how to use aggregate for the exact reason we used the previous loop.

What makes aggregate() special is the ability to aggregate multiple variables and/or by multiple variables. As an example, let’s find the average sat_sum, sat_m, and sat_v by both hs_gpa and sex.

I am going to make an assumption about the sex variable. If sex is 1, it denotes a female.

With aggregate(), you can also use other functions like length() to calculate the number of observations that fall within each bucket. Or, you can use sd() to calculate standard deviations. If we calculate mean(), length(), and sd(), we can create confidence intervals. The following code with create confidence intervals for each GPA.

Back to Economic Models

Now that we can visually see a relationship, we need to figure out how to quantify it. We are going to be introduced to something called covariance. This is a measure of how two variables move with one another. Let’s work up to the formula with some intuition.

We are going to compare each data point to it’s respective mean. In the following chunk, subtract off the mean from each variable. Then, plot the new data, and include vertical and horizontal lines at 0.

Solution
Code
df$gpa_0 <- df$hs_gpa - mean(df$hs_gpa)
df$sat_0 <- df$sat_sum - mean(df$sat_sum)
plot(df$gpa_0, df$sat_0, las = 1, pch = 19,
     col = alpha("dodgerblue", 0.2),
     xlab = "GPA - mean GPA", ylab = "SAT - mean SAT")
abline(h = 0, v = 0)
Plot

If we find that \(x_i - \bar{x} > 0\) and \(y_i - \bar{y} > 0\) (or \(x_i - \bar{x} < 0\) and \(y_i - \bar{y} < 0\)) at the same time more often than not, this is good evidence that the two variables are moving together. In words, if GPA is above its mean at the same time that SAT is above its mean (or below/below), these variables are moving together. See the figure below for an illustration.

For each point, we can tell if both values are positive (meaning both above their mean) or both values are negative (meaning both below their mean) by multiplying them together. Two positive numbers (or two negative numbers), when multiplied together, will produce a positive number. If one number is negative and the other is positive, then the product will be negative. To simplify this, we can use the sign() function to output +1, 0, or -1 depending on the sign of the input.

If the points are exactly evenly distributed, we would expect the average of sign(df$gpa_0) * sign(df$sat_0) to be equal to 0 (since the sum of -1s and +1s cancel out to zero). However, if the average is positive, this would suggest a positive relationship.

The mean of this outcome is indeed positive (0.304). Using the table() function, we can see that most of the values fall in the (-1, -1) or (+1, +1) groups:

We can use the actual differences instead of using sign(), too. Why might is this be better than using sign()? First, there are some calculus reasons for this that we won’t discuss. Second, for intuition, this will weigh observations that are farther from the mean a bit more than ones close to the mean.

The formula for this average looks like the following:

\[\frac{1}{n}\sum_{i = 1}^n (x_i - \bar{x})\times(y_i - \bar{y})\]

Take a second to state at this… Doesn’t it look familiar? Perhaps it would look like the formula for variance if we substituted \(x_i\) for \(y_i\):

\[\frac{1}{n}\sum_{i = 1}^n (y_i - \bar{y})\times(y_i - \bar{y}) = \frac{1}{n}\sum_{i = 1}^n (y_i - \bar{y})^2\]

This formula (with \(x_i\)) is called covariance. A variable’s covariance with itself is equal to its own variance.

In both formulas, I am dividing by \(n\) for illustration purposes. However, the correct formula is \(n-1\) as the denominator.

Covariance

What is the covariance of these two different variables?

Unfortunately, there is not a very intuitive interpretation in words for what this number means. Rather, we mostly interpret the sign of the number. Positive means that the variables move together and negative means the variables move in opposite ways.

Let’s see if we can turn this number into something that is interpretable. Let’s work with the variance formula to start. Suppose we have data on some distance, and our variance is 36 miles2. To make this number more interpretable, we would take the square root, which would give us a standard deviation of 6 miles. Instead of taking the square root, we also could have divided the variance by the standard deviation which would have given us the same answer. For example: \(\sqrt{\sigma^2} = \frac{\sigma^2}{\sigma} = \sigma\). Let’s do the same thing for our GPA and SAT covariance.

Currently, we have 3.3337488 as our covariance, and the units are GPA units \(\times\) SAT units. To scale this, we can divide by the standard deviation of either variable. But, which variable should we divide by? Por que no los dos?image

Going back to variance, if we divide variance (\(\sigma^2\)) by its standard deviation (\(\sigma\)) twice, we would get \(\frac{\sigma^2}{\sigma \times \sigma} = \frac{\sigma^2}{\sigma^2} = 1\). Remember, the covariance of any variable with itself is equal to its variance. There will be no other variable in the world with a stronger relationship to variable \(y\) than \(y\) itself! Therefore, a value close to 1 indicates that two variables are almost identical. On the other hand, imagine the covariance of \(y\) and \(-y\). In this case, when \(y\) is above its mean, \(-y\) must be below its mean. Calculating the covariance of \(y\) and \(-y\) will result in the same magnitude as the variance of \(y\), but the sign will be negative. Therefore, a value close to -1 indicates that two variables are nearly complete opposites.

So, +1 is the maximum value and indicates the strongest positive relationship possible.

Then, -1 is the minimum value and indicates the strongest negative relationship possible.

Let’s scale the covariance of hs_gpa and sat_sum similarly to how we scaled the variance of \(y\): divide by the standard deviation of both variables.

Before, we calculated \(\frac{cov(y, y)}{\sigma_y \times \sigma_y}\).Now, we’ll calculate \(\frac{cov(\text{GPA}, \text{SAT})}{\sigma_{\text{GPA}} \times \sigma_{\text{SAT}}}\).

This number is a unitless measurement of the strength and direction of the relationship between two variables. This number is called the correlation coefficient, denoted rho (\(\rho\)).

\[ \rho_{x,y} = \frac{cov(x, y)}{\sigma_x \sigma_y}\]

Correlation

Below is a visualization of the previous scenarios (\(y\) vs \(y\), GPA vs SAT, \(y\) vs \(-y\)):

Code
par(mfrow = c(1, 3))
plot(df$hs_gpa, df$hs_gpa, xlab = "GPA", ylab = "GPA",
     main = c("Correlation:", round(cor(df$hs_gpa, df$hs_gpa), 3)),
     col = alpha("dodgerblue", 0.2), pch = 19)
abline(v = mean(df$hs_gpa), h = mean(df$hs_gpa))
plot(df$hs_gpa, df$sat_sum, xlab = "GPA", ylab = "SAT",
     main = c("Correlation:", round(cor(df$hs_gpa, df$sat_sum), 3)),
     col = alpha("mediumseagreen", 0.2), pch = 19)
abline(v = mean(df$hs_gpa), h = mean(df$sat_sum))
plot(df$hs_gpa, -df$hs_gpa, xlab = "GPA", ylab = "-GPA",
     main = c("Correlation:", round(cor(df$hs_gpa, -df$hs_gpa), 3)),
     col = alpha("tomato", 0.2), pch = 19)
abline(v = mean(df$hs_gpa), h = mean(-df$hs_gpa))
par(mfrow = c(1, 1))
Plot

The word “correlation” is one of the few words that have been able to sneak out of statistics textbooks and find its way into ordinary, every day vocabulary. I’m sure you’ve heard or said “this is correlated with that”, or “correlation does not mean causation”, etc. Correlation is a very specifically defined term, and hopefully, you have a deeper understanding of what it really means now.

If two variables (\(A\) and \(B\)) are correlated, this is because:

  1. \(A \rightarrow B\) (\(A\) causes \(B\))
  2. \(B \rightarrow A\)
  3. \(C \rightarrow A\) and \(C \rightarrow B\) (some third factor causes both)
  4. Sometimes, things are correlated by random chance.

Each time you find a correlation between two variables, you should think deeply about all four of these possible reasons.

In our example about GPAs and SAT scores, it is likely that there is a third factor, or quite a few other factors, that determine both someone’s GPA and SAT score. Therefore, the correlation we identified should not be interpreted as causal.

Take a look at this website of spurious correlations for examples of things that are correlated due to chance.

If we are after causation, what is the best way to rule out all of the ways we could find a correlation between \(A\) and \(B\)? Randomization!

Suppose we wanted to know if a new sunscreen prevented sunburn. The best way to answer this question would be to take a group of people and split them into two groups. The first group gets the sunscreen and the other group gets lotion.

Randomization would:

  • Remove the possibility of reverse causality (Story 2). In other words, without randomization, maybe people only use sunscreen once they get a sun burn. This would make it look like sunscreen is causing sunburns, even though it’s the other way around.
  • Imply the treatment and control group differ only by treatment, which rules out the story of a third factor causing the correlation (Story 3). Without randomization, since younger people are generally more risky, they are less likely to use sunscreen. Additionally, younger people are also more likely to spend time outside which leads to more sunburns. Therefore, we would find that sunscreen leads to less sunburn, but this would be due to age influencing both variables.
  • Ensure any correlation is not driven by statistical chance once the sample size is large enough (Story 4).

Calculating correlations in R can be done with the cor() function. This outputs a correlation coefficient when you enter two vectors. Moreover, you can feed cor() a data.frame, and it will calculate a correlation matrix. The matrix contains the correlation coefficients calculated from every combination of variables.

Footnotes

  1. This is not to say that increases in quantity demanded can influence price.↩︎

  2. I calculated this via: length(unique(df$hs_gpa))↩︎