Code
<- c(32000, 33000, 36000, 33000, 32000, 35000)
salaries85 <- c(31000, 34000, 32000, 34000, 33000, 20000000) salaries86
Module 2.2: Dispersion
All materials can be found at alexcardazzi.github.io.
The middle of some data is an important summary point. However, a second important measure is: how much does the data vary? Are the observations in the data close to the mean, or are they far?A measure like this would start to address the issue with outliers, but also provide additional information that could be helpful.
This idea is called dispersion. A common measure of dispersion is variance or standard deviation.
Let’s try to build an equation for dispersion that we can use. Remember: we are interested in quantifying how far away observations are from the mean (\(\mu\) or \(\bar{y}\)).
Since we cannot observe \(\mu\) (because it is a population parameter), we have to use \(\bar{y}\) as our measure of the mean. So, to calculate distance from the mean, we can simply use subtraction.
\[y_i - \bar{y}\]
Note: I am using a subscript \(i\) to denote the \(i\)th observation. In other words, if there are \(n\) observations, we will calculate:
\[y_1 - \bar{y}, \ y_2 - \bar{y}, \ ..., \ y_i - \bar{y}, \ ..., \ y_n - \bar{y}\]
If we have the distance of each observation from the mean, why don’t we try taking the mean of this? This would give us, in words, an average distance from the mean. In math:
\[\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})\]
To test our measure, let’s use some example data from the notes on Central Tendency. Suppose we have two sets of data: one from UNC Geography Majors in 1985 and 1986 (Michael Jordan’s graduating class). If our measure makes sense, we should expect the 1985 mean to be less than the 1986 mean, but the dispersion to be lower as well. Here are the two datasets:
<- c(32000, 33000, 36000, 33000, 32000, 35000)
salaries85 <- c(31000, 34000, 32000, 34000, 33000, 20000000) salaries86
Let’s calculate the two sample means and save them.
<- mean(salaries85)
mean85 <- mean(salaries86)
mean86 cat("Average Salary in 1985:", mean85, "\n")
cat("Average Salary in 1986:", mean86)
Average Salary in 1985: 33500
Average Salary in 1986: 3360667
Now, we need to code up our formula for dispersion and test it with both sets of data.
# add up the differences
<- sum(salaries85 - mean85)
numerator <- length(salaries85)
denominator / denominator numerator
[1] 0
# add up the differences
<- sum(salaries86 - mean86)
numerator <- length(salaries86)
denominator / denominator numerator
[1] 0
Not only are the two numbers the same, but they’re zero! There is either a mistake in the math or in the code. Hint: the mistake is not in the code. Rather, let’s generate more output as we go this time.
# Calculate the differences
<- salaries85 - mean85
numerator cat("Numbers to Add:", numerator, "\n")
cat("Sum of Numbers to Add:", sum(numerator))
Numbers to Add: -1500 -500 2500 -500 -1500 1500
Sum of Numbers to Add: 0
Since some of the values are above the mean and others are below, the sum of the positive and negative numbers nets out to zero. Let’s see if this is true in the algebra.
Note: I am dropping the \(\frac{1}{n}\) because the issue is clearly in the numerator, not the denominator
\[\begin{aligned}\sum_{i=1}^{n} (y_i - \bar{y}) &= (y_1 - \bar{y}) + (y_2 - \bar{y}) + ... + (y_n - \bar{y}) \\ &=(y_1 + ... + y_n) - (\bar{y} + ... + \bar{y})\\ &= \sum{y_i} - (n \times \bar{y})\\ &= \sum{y_i} - (n \times \frac{1}{n}\sum{y_i}) \\ &= \sum{y_i} - (n \times \frac{1}{n}\sum{y_i})\\ &= \sum{y_i} - (\sum{y_i}) \\ &= 0\end{aligned}\]
The conclusion to both the math and the code is that, no matter what, \(\sum_{i=1}^n(y_i - \bar{y})\) is always equal to 0.
So, it seems like we need another equation. What if we tried to transform \(y_i - \bar{y}\) so that it was always a positive number? We can use a square term! So, instead of \(y_i - \bar{y}\), we can use \((y_i - \bar{y})^2\). Then, our formula would look like:
You might have thought about absolute values here. This is a good thought, and I am hoping you arrived at this. Unfortunately, absolute value is a weird function that doesn’t have nice properties. We won’t get into it here, but I thought I should note it.
\[\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2\]
Let’s try to build this equation in R and test it on our data.
# add up the differences
<- sum((salaries85 - mean85)^2)
numerator <- length(salaries85)
denominator / denominator numerator
[1] 2250000
# add up the differences
<- sum((salaries86 - mean86)^2)
numerator <- length(salaries86)
denominator / denominator numerator
[1] 5.537348e+13
These numbers are huge! These are not interpretable as is, and it’s because we squared each distance. In words, this is the average squared distance from the mean. This is what statisticians call variance, which is often labeled \(\sigma^2\).
To get a more interpretable number, we have to “undo” the squared term. We can use the square root function to do this. Taking the square root of variance gives standard deviation, often labelled \(\sigma\).
\[\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2}\]
And in R:
# add up the differences
<- sum((salaries85 - mean85)^2)
numerator <- length(salaries85)
denominator sqrt(numerator / denominator)
[1] 1500
# add up the differences
<- sum((salaries86 - mean86)^2)
numerator <- length(salaries86)
denominator sqrt(numerator / denominator)
[1] 7441336
The interpretation of this number is the average distance from the mean. In other words, suppose we randomly picked an element out of a vector. On average, the element we pick will be a distance of \(\sigma\) units from the mean.
Of course, R has built in functions for variance (var()
) and standard deviation (sd()
). There is a slight difference in what we coded and what R will output. The denominator R uses is \(n-1\) instead of just \(n\), but the reason for this is beyond the scope of this course. However, for completeness and comparison:
<- sum((salaries85 - mean85)^2)
numerator <- length(salaries85)
denominator cat("St. Dev. (with n):", sqrt(numerator / denominator), "\n")
cat("St. Dev. (with n-1):", sqrt(numerator / (denominator-1)), "\n")
cat("St. Dev. (R Default):", sd(salaries85))
St. Dev. (with n): 1500
St. Dev. (with n-1): 1643.168
St. Dev. (R Default): 1643.168
As a final note, population variance is usually denoted by the Greek letter sigma (squared), or \(\sigma^2\). Sample variance is denoted by \(s^2\). Therefore, population and sample standard deviations are generally \(\sigma\) and \(s\).