One Y Variable

Module 2.2: Dispersion

Alex Cardazzi

Old Dominion University

Dispersion

The middle of some data is an important summary point. However, a second important measure is: how much does the data vary? Are the observations in the data close to the mean, or are they far?A measure like this would start to address the issue with outliers, but also provide additional information that could be helpful.

This idea is called dispersion. A common measure of dispersion is variance or standard deviation.

Building an Equation

Let’s try to build an equation for dispersion that we can use. Remember: we are interested in quantifying how far away observations are from the mean (\(\mu\) or \(\bar{y}\)).

Since we cannot observe \(\mu\) (because it is a population parameter), we have to use \(\bar{y}\) as our measure of the mean. So, to calculate distance from the mean, we can simply use subtraction.

\[y_i - \bar{y}\]

Note: I am using a subscript \(i\) to denote the \(i\)th observation. In other words, if there are \(n\) observations, we will calculate:

\[y_1 - \bar{y}, \ y_2 - \bar{y}, \ ..., \ y_i - \bar{y}, \ ..., \ y_n - \bar{y}\]

Building an Equation

If we have the distance of each observation from the mean, why don’t we try taking the mean of this? This would give us, in words, an average distance from the mean. In math:

\[\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})\]

To test our measure, let’s use some example data from the notes on Central Tendency. Suppose we have two sets of data: one from UNC Geography Majors in 1985 and 1986 (Michael Jordan’s graduating class). If our measure makes sense, we should expect the 1985 mean to be less than the 1986 mean, but the dispersion to be lower as well. Here are the two datasets:

Code

salaries85 <- c(32000, 33000, 36000, 33000, 32000, 35000)
salaries86 <- c(31000, 34000, 32000, 34000, 33000, 20000000)

Building an Equation

Let’s calculate the two sample means and save them.

Code

mean85 <- mean(salaries85)
mean86 <- mean(salaries86)
cat("Average Salary in 1985:", mean85, "\n")
cat("Average Salary in 1986:", mean86)

Output

Average Salary in 1985: 33500 
Average Salary in 1986: 3360667

Now, we need to code up our formula for dispersion and test it with both sets of data.

Code

# add up the differences
numerator <- sum(salaries85 - mean85)
denominator <- length(salaries85)
numerator / denominator

Output

[1] 0

Code

# add up the differences
numerator <- sum(salaries86 - mean86)
denominator <- length(salaries86)
numerator / denominator

Output

[1] 0

Building an Equation

Not only are the two numbers the same, but they’re zero! There is either a mistake in the math or in the code. Hint: the mistake is not in the code. Rather, let’s generate more output as we go this time.

Code

# Calculate the differences
numerator <- salaries85 - mean85
cat("Numbers to Add:", numerator, "\n")
cat("Sum of Numbers to Add:", sum(numerator))

Output

Numbers to Add: -1500 -500 2500 -500 -1500 1500 
Sum of Numbers to Add: 0

Since some of the values are above the mean and others are below, the sum of the positive and negative numbers nets out to zero. Let’s see if this is true in the algebra.

Building an Equation

\[\begin{aligned}\sum_{i=1}^{n} (y_i - \bar{y}) &= (y_1 - \bar{y}) + (y_2 - \bar{y}) + ... + (y_n - \bar{y}) \\ &=(y_1 + ... + y_n) - (\bar{y} + ... + \bar{y})\\ &= \sum{y_i} - (n \times \bar{y})\\ &= \sum{y_i} - (n \times \frac{1}{n}\sum{y_i}) \\ &= \sum{y_i} - (n \times \frac{1}{n}\sum{y_i})\\ &= \sum{y_i} - (\sum{y_i}) \\ &= 0\end{aligned}\]

Building an Equation

The conclusion to both the math and the code is that, no matter what, \(\sum_{i=1}^n(y_i - \bar{y})\) is always equal to 0.

So, it seems like we need another equation. What if we tried to transform \(y_i - \bar{y}\) so that it was always a positive number? We can use a square term! So, instead of \(y_i - \bar{y}\), we can use \((y_i - \bar{y})^2\). Then, our formula would look like:

\[\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2\]

Building an Equation

Let’s try to build this equation in R and test it on our data.

Code

# add up the differences
numerator <- sum((salaries85 - mean85)^2)
denominator <- length(salaries85)
numerator / denominator

Output

[1] 2250000

Code

# add up the differences
numerator <- sum((salaries86 - mean86)^2)
denominator <- length(salaries86)
numerator / denominator

Output

[1] 5.537348e+13

These numbers are huge! These are not interpretable as is, and it’s because we squared each distance. In words, this is the average squared distance from the mean. This is what statisticians call variance, which is often labeled \(\sigma^2\).

Building an Equation

To get a more interpretable number, we have to “undo” the squared term. We can use the square root function to do this. Taking the square root of variance gives standard deviation, often labelled \(\sigma\).

\[\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2}\]

And in R:

Code

# add up the differences
numerator <- sum((salaries85 - mean85)^2)
denominator <- length(salaries85)
sqrt(numerator / denominator)

Output

[1] 1500

Code

# add up the differences
numerator <- sum((salaries86 - mean86)^2)
denominator <- length(salaries86)
sqrt(numerator / denominator)

Output

[1] 7441336

Building an Equation

The interpretation of this number is the average distance from the mean. In other words, suppose we randomly picked an element out of a vector. On average, the element we pick will be a distance of \(\sigma\) units from the mean.

Building an Equation

Of course, R has built in functions for variance (var()) and standard deviation (sd()). There is a slight difference in what we coded and what R will output. The denominator R uses is \(n-1\) instead of just \(n\), but the reason for this is beyond the scope of this course. However, for completeness and comparison:

Code

numerator <- sum((salaries85 - mean85)^2)
denominator <- length(salaries85)
cat("St. Dev. (with n):", sqrt(numerator / denominator), "\n")
cat("St. Dev. (with n-1):", sqrt(numerator / (denominator-1)), "\n")
cat("St. Dev. (R Default):", sd(salaries85))

Output

St. Dev. (with n): 1500 
St. Dev. (with n-1): 1643.168 
St. Dev. (R Default): 1643.168

Building an Equation

As a final note, population variance is usually denoted by the Greek letter sigma (squared), or \(\sigma^2\). Sample variance is denoted by \(s^2\). Therefore, population and sample standard deviations are generally \(\sigma\) and \(s\).