Module 2.2: Dispersion
Old Dominion University
The middle of some data is an important summary point. However, a second important measure is: how much does the data vary? Are the observations in the data close to the mean, or are they far?A measure like this would start to address the issue with outliers, but also provide additional information that could be helpful.
This idea is called dispersion. A common measure of dispersion is variance or standard deviation.
Let’s try to build an equation for dispersion that we can use. Remember: we are interested in quantifying how far away observations are from the mean (\(\mu\) or \(\bar{y}\)).
Since we cannot observe \(\mu\) (because it is a population parameter), we have to use \(\bar{y}\) as our measure of the mean. So, to calculate distance from the mean, we can simply use subtraction.
\[y_i - \bar{y}\]
Note: I am using a subscript \(i\) to denote the \(i\)th observation. In other words, if there are \(n\) observations, we will calculate:
\[y_1 - \bar{y}, \ y_2 - \bar{y}, \ ..., \ y_i - \bar{y}, \ ..., \ y_n - \bar{y}\]
If we have the distance of each observation from the mean, why don’t we try taking the mean of this? This would give us, in words, an average distance from the mean. In math:
\[\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})\]
To test our measure, let’s use some example data from the notes on Central Tendency. Suppose we have two sets of data: one from UNC Geography Majors in 1985 and 1986 (Michael Jordan’s graduating class). If our measure makes sense, we should expect the 1985 mean to be less than the 1986 mean, but the dispersion to be lower as well. Here are the two datasets:
Let’s calculate the two sample means and save them.
Now, we need to code up our formula for dispersion and test it with both sets of data.
Not only are the two numbers the same, but they’re zero! There is either a mistake in the math or in the code. Hint: the mistake is not in the code. Rather, let’s generate more output as we go this time.
Since some of the values are above the mean and others are below, the sum of the positive and negative numbers nets out to zero. Let’s see if this is true in the algebra.
\[\begin{aligned}\sum_{i=1}^{n} (y_i - \bar{y}) &= (y_1 - \bar{y}) + (y_2 - \bar{y}) + ... + (y_n - \bar{y}) \\ &=(y_1 + ... + y_n) - (\bar{y} + ... + \bar{y})\\ &= \sum{y_i} - (n \times \bar{y})\\ &= \sum{y_i} - (n \times \frac{1}{n}\sum{y_i}) \\ &= \sum{y_i} - (n \times \frac{1}{n}\sum{y_i})\\ &= \sum{y_i} - (\sum{y_i}) \\ &= 0\end{aligned}\]
The conclusion to both the math and the code is that, no matter what, \(\sum_{i=1}^n(y_i - \bar{y})\) is always equal to 0.
So, it seems like we need another equation. What if we tried to transform \(y_i - \bar{y}\) so that it was always a positive number? We can use a square term! So, instead of \(y_i - \bar{y}\), we can use \((y_i - \bar{y})^2\). Then, our formula would look like:
\[\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2\]
Let’s try to build this equation in R and test it on our data.
These numbers are huge! These are not interpretable as is, and it’s because we squared each distance. In words, this is the average squared distance from the mean. This is what statisticians call variance, which is often labeled \(\sigma^2\).
To get a more interpretable number, we have to “undo” the squared term. We can use the square root function to do this. Taking the square root of variance gives standard deviation, often labelled \(\sigma\).
\[\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2}\]
And in R:
The interpretation of this number is the average distance from the mean. In other words, suppose we randomly picked an element out of a vector. On average, the element we pick will be a distance of \(\sigma\) units from the mean.
Of course, R has built in functions for variance (var()
) and standard deviation (sd()
). There is a slight difference in what we coded and what R will output. The denominator R uses is \(n-1\) instead of just \(n\), but the reason for this is beyond the scope of this course. However, for completeness and comparison:
St. Dev. (with n): 1500
St. Dev. (with n-1): 1643.168
St. Dev. (R Default): 1643.168
As a final note, population variance is usually denoted by the Greek letter sigma (squared), or \(\sigma^2\). Sample variance is denoted by \(s^2\). Therefore, population and sample standard deviations are generally \(\sigma\) and \(s\).
ECON 311: Economics, Causality, and Analytics