One Y Variable

Module 2.3: Distributions

Author

Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

Distributions

So far, we have characterized certain “moments” of data (i.e., mean and variance).

While these two numbers are helpful in understanding the data, they do not tell the full story. As an example, let’s consider the following data:

Code

df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/anscombe.csv")
cat("Summary Statistics for x1. Mean:", mean(df$x1), "; Variance:", var(df$x1), "\n")
cat("Summary Statistics for x4. Mean:", mean(df$x4), "; Variance:", var(df$x4))

Output

Summary Statistics for x1. Mean: 9 ; Variance: 11 
Summary Statistics for x4. Mean: 9 ; Variance: 11

Both of these columns have the same average and variance (and thus standard deviation), so you might think that they must in fact look like similar. Let’s take a look at the data and see for ourselves:

Code

df[,c("x1", "x4")]

Output

Clearly, even those data have the same mean and variance, they have very different distributions. The first column had many values ranging from 4 to 14, but the second column had one 19 and the rest 8s. Note that this is the inverse of the case of Michael Jordan and UNC. There, the distributions were presumably very similar with the exception of Michael Jordan’s single outlier salary.

Before we can keep talking about distributions, though, we need to discuss probability briefly. Put a pin in this idea of distributions for a moment.

Probability

A value between 0 and 1 inclusive that represents the likelihood a particular event happens.

The probability that a coin is flipped and it lands on tails is \(\frac{1}{2}\) or \(0.5\) or \(50\%\). What about the probabilities and outcomes of flipping two coins and counting the number of tails? Well, since flipping this coin is a random variable, it has a distribution. That distribution is defined by all possible outcomes and the associated probabilities:

Percentages are not really probabilities, but they are often associated with them.

Probability Distribution of Two Coin Flips
Number of Tails, \(y\)	Probability, \(P(y)\)
0	0.25
1	0.50
2	0.25

All distributions have a mean and a variance. So, how do we find the mean and variance of a probability distribution?

In fact, not all distributions have defined means and/or variances, but these types of distributions are outside the scope of this course.

\[ \mu = \sum_{i = 1}^g P(y_i) \times y_i \]

\[ \sigma^2 = \sum_{i = 1}^g P(y_i) \times (y_i - \mu)^2 \]

These formulas should look a lot like formulas for weighted averages!

To practice, let’s use the previous distribution of the number of heads from flipping two coins.

\[ \mu = \sum_{i = 1}^g P(y_i) \times y_i = (0.25\times0) + (0.5\times1) + (0.25\times2) = 1 \]

\[ \begin{aligned}\sigma^2 &= \sum_{i = 1}^g P(y_i) \times (y_i - \mu)^2 \\ &= (0.25\times(0-1)^2) + (0.5\times(1-1)^2) + (0.25\times(2-1)^2) \\ &= (0.25\times(-1)^2) + (0.5\times(0)^2) + (0.25\times(1)^2) \\ &= 0.25 + 0 + 0.25 = 0.5\end{aligned} \]

Flipping coins is an example of a discrete distribution. In other words, individual events have non-zero probabilities of occurring. However, when distributions are continuous, the probability of any individual event occurring is precisely zero. What? How can that be true?

Consider the lifespan of a tire on a car. Suppose the average tire lasts about 54,000 miles. However, what is the probability that your tire lasts exactly 54,000 miles? Essentially 0%.

This is because your tire could last 54,000.1 miles, 54,000.01 miles, and so on. There are an infinite number of miles that your tire could last, so the probability of any specific mile being the one where your tire goes flat is zero.

When dealing with continuous probabilities, instead of asking for the \(P(y) = M\) (e.g., the probability that your tire lasts exactly \(M\) miles), people ask for the \(P(y) < M\), \(P(y) > m\), or \(m < P(y) < M\) (e.g., the probability that your tire lasts more or less than some number of miles, or the probability that your tire lasts between two numbers of miles.)

In the interest of being concrete, suppose tires last anywhere between 40,000 and 68,000 miles. Suppose further that the distribution is uniform, meaning that the probability of your tire going flat is the same from 40,000 to 68,000 miles. A question from a statistics class might be: “what is the probability that your tire lasts more than 60,000 miles?”

To answer this, you need to know what fraction of tires last above 60,000. Well, since all tires go flat between 40 and 68, we can say the baseline is 68-40 = 28. Then, 68-60 is the range we’re interested in. So, the probability would be \(\frac{68-60}{68-40} = 0.286\).

The Normal Distribution

The above math highlights how you would deal with probabilities and continuous distributions. However, not all continuous distributions are “uniform” like we assumed above. Perhaps the most famous and widely studied distribution is the normal distribution.

The normal distribution:

is bell-shaped with it’s peak in the center
is symmetric, meaning the mean is the same as the median, and it asymptotically approaches the x-axis (meaning it gets close but never touches it!)
is completely characterized by its mean and standard deviation, and is written like \(N(\mu, \sigma)\)

Below is an example of what a normal distribution looks like:

Plot

Example of a Normal Distribution

Normal distributions take on different shapes depending on their means and standard deviations.

A increasing the mean, while holding the standard deviation constant, results in the distribution sliding horizontally to the right along the x-axis.
An increase in the standard deviation, while holding the mean constant, results in a shorter, wider distribution.

Below are some examples of what these changes would do to a standard normal distribution:

Plot

Example of Multiple Normal Distributions

Normal Distribution Application

To learn the most incredible feature of the normal distribution, you’ll have to wait for later in the module. However, while we wait, let’s discuss how to compare two different normal distributions.

Two particularly famous distributions that are normally distributed are scores on the SAT and ACT. The maximum score for each is 1600 and 36, respectively. The SAT has an average score of 1050 with a standard deviation of 200. The ACT has an average score of 19 with a standard deviation of 6.

So, let’s say someone scores a 1,300 on the SAT and a 29 on the ACT. Which test did they do better on?

Well, let’s first visualize the two distributions and the two scores.

Plot

Unfortunately, this visualization is not so helpful. Since the two distributions have different means and standard deviations, we cannot really compare them. A first step to getting a better comparison would be to subtract off the mean from both distributions. Then, both distributions will have a mean of zero.

Plot

Still, this is not helpful. Now, the variances are distorting the comparison. To fix this, we can divide each by their respective variances. This way, both distributions have a mean of zero and a variance of 1.

Plot

Now, it should be quite obvious which score was “better”. The SAT score currently has a value of 1.25 while the ACT score has a value of 1.6666667.

What do these numbers mean, though? In words, this is the number of standard deviations away from the mean the original score is. This actually has a name, and it is a \(z\)-score. The formula for a \(z\)-score is:

\[ z = \frac{y - \mu}{\sigma} \]

\(z\)-scores come from what is called a standard normal distribution, which is just a normal distribution with a mean of 0 and standard deviation (and variance) of 1, or \(N(0, 1)\). We know a lot about the standard normal distribution. For example, we know the probability that a random draw from a normal distribution is greater than some number, less than some number, or between two numbers. In most statistics courses, this is where you’d be shown a \(z\)-score table. Rather than spending time on this table, here are some takeaways:

Since the distribution is symmetric, the probability of a random draw being larger (or smaller) than 0 is \(\frac{1}{2}\).
There is a 15.87% chance that a random draw is larger (smaller) than +1 (-1).
There is a 2.28% chance that a random draw is larger (smaller) than +2 (-2).
- This means that there is a 95.44% chance that the random draw is between -2 and +2.

Let me add some words to all of this math. Remember, if we picked a random element out of a vector, we would expect that element to be \(\sigma\) units away from the mean. What is the \(z\)-score doing? It is scaling the distance from some value \(y\) to the mean \(\mu\) by a measure of the average distance from the mean, \(\sigma\). Therefore, values of \(z\) that are greater in absolute value than 1 indicate that \(y\) is farther than the average element, and values less then one indicate \(y\) is closer than average. Being more than two times the average distance from the mean, for example, is pretty rare – there is only a 2.28% chance. A first important feature of \(z\)-scores is that they allow for the comparison of values from different distributions (e.g. SAT and ACT). A second important feature is that they allow us to calculate probabilities for every normal distribution, regardless of the mean, variance, and/or units.

Normal Distributions in R

R has some pre-built functions to work with normal distributions.

pnorm() gives \(P(Z < y)\) when lower.tail = TRUE. When lower.tail = FALSE, this gives \(P(Z > y)\)
- pnorm(0, lower.tail = TRUE): 0.5
- pnorm(2, lower.tail = TRUE): 0.9772499.
- pnorm(2, lower.tail = FALSE): 0.0227501.
qnorm() is the opposite of pnorm() in that it converts probabilities into \(z\)-scores.
- qnorm(0.9772499, lower.tail = TRUE): 2.0000006
rnorm() generates random draws from a standard normal distribution with a given mean and standard deviation. The default mean is 0 and the default standard deviation is 1. In your first HW, you used a function called sample() that generated random data. This is a function that does a similar task.

Arbitrary Distributions

What about arbitrary distributions that we observe in data? We know how to calculate the mean and variance of a random variable, but it’s helpful to visualize the distribution.

If your data is sufficiently discrete, I prefer to use table(). Of course, “sufficiently discrete” is difficult to define. If your data has many unique values, you may need to transform the data via rounding before using table(). Otherwise, you can use hist() to generate a histogram (which is effectively what table() is doing) or density() to create a kernel density plot. You can think of kernel density plots as smoothed out histograms.

Let’s take a look at some real estate transaction data as an example.

Code

# https://vincentarelbundock.github.io/Rdatasets/doc/AER/HousePrices.html
data_url <- "https://vincentarelbundock.github.io/Rdatasets/csv/AER/HousePrices.csv"
df <- read.csv(data_url)
# Keep only columns 2 through 6.
# These columns are the ones that contain numeric information.
# What this step below is doing is sub-setting the columns,
#   then saving over the original.
df <- df[,2:6]
head(df)

Output

  price lotsize bedrooms bathrooms stories
1 42000    5850        3         1       2
2 38500    4000        2         1       1
3 49500    3060        3         1       1
4 60500    6650        3         1       2
5 61000    6360        2         1       1
6 66000    4160        3         1       1

Remember the function table()? This function returned the number of times each unique value appeared in the data. This is, definitionally, a distribution! Let’s try combining table() and plot(), starting with bedrooms:

Code

plot(table(df$bedrooms),
     xlab = "Number of Bedrooms",
     ylab = "Frequency")

Plot

From this, we can see that the distribution is somewhat symmetric with the most density at 3. For bedrooms in a home, this seems reasonable. Also note that there are no decimal values, because we cannot have 3.5 bedrooms (though you can have 3.5 bathrooms!).

Next, let’s try to plot the distribution of sale price.

Code

plot(table(df$price), xlab = "Sale Price", ylab = "Frequency")

Plot

This is not a very informative plot as there’s no discernible pattern. Rather, let’s divide by 10,000, round the data, and re-multiply by 10,000.

Code

plot(table(round(df$price/10000)*10000), xlab = "Sale Price", ylab = "Frequency")

Plot

This is a much better plot. We can see that were are some outliers that are over 150,000 dollars, and the modal sale is about 60,000 dollars.

Finally, let’s say we want a way to calculate multiple summary statistics (e.g. mean, median, minimum, maximum) for multiple variables at once. Luckily for us, we can use R’s summary() function to help us.

Code

summary(df)

Output

     price           lotsize         bedrooms       bathrooms    
 Min.   : 25000   Min.   : 1650   Min.   :1.000   Min.   :1.000  
 1st Qu.: 49125   1st Qu.: 3600   1st Qu.:2.000   1st Qu.:1.000  
 Median : 62000   Median : 4600   Median :3.000   Median :1.000  
 Mean   : 68122   Mean   : 5150   Mean   :2.965   Mean   :1.286  
 3rd Qu.: 82000   3rd Qu.: 6360   3rd Qu.:3.000   3rd Qu.:2.000  
 Max.   :190000   Max.   :16200   Max.   :6.000   Max.   :4.000  
    stories     
 Min.   :1.000  
 1st Qu.:1.000  
 Median :2.000  
 Mean   :1.808  
 3rd Qu.:2.000  
 Max.   :4.000

However, summary() isn’t our best option for creating a nice and neat table. We are going to use our first R package for this task. Using an R package requires two steps. First, you will need to install the package (download the package from the internet onto your computer), and then call the package (loading the package into the current R session). You only need to install the package once, but you need to call the package every time you use it. Installing/loading packages is something students do incorrectly all the time. Take note of this!

Below, I am installing the modelsummary package (which only needs to be done once!) and then loading it into our session (which needs to be done every time you want to use it!). You can find the documentation for modelsummary here.

Code

# only use install.packages() once per package
install.packages("modelsummary")
# use this every time you want to use 'modelsummary'
library("modelsummary")
# add this line when loading 'modelsummary', too
# note that you may need to install 'kableExtra' as well
options("modelsummary_factory_default" = "kableExtra")

This package contains a few different functions, but we are going to focus on two: datasummary() and datasummary_skim(). You can find the documentation for datasummary here. To start, let’s see the output of datasummary_skim().

Code

# Other arguements you might want to use:
#   type = "numeric"
#   fmt = fmt_significant(2)
datasummary_skim(df)

	Unique	Mean	SD	Min	Median	Max
price	219	68121.6	26702.7	25000.0	62000.0	190000.0
lotsize	284	5150.3	2168.2	1650.0	4600.0	16200.0
bedrooms	6	3.0	0.7	1.0	3.0	6.0
bathrooms	4	1.3	0.5	1.0	1.0	4.0
stories	4	1.8	0.9	1.0	2.0	4.0

While this very quickly provides us with a lot of useful information, it lacks customization. On the other hand, datasummary() provides us with much more versatility though requires a bit more effort. To start, take a look at the below example. We are summarizing price, lotsize, and bedrooms from the dataset df. For each variable, we are going to calculate length (sample size), mean, sd, min, and max.

Code

datasummary(price + lotsize + bedrooms ~ length + mean + sd + min + max,
            data = df, fmt = fmt_significant(2))

	length	mean	sd	min	max
price	546	68122	26703	25000	190000
lotsize	546	5150	2168	1650	16200
bedrooms	546	3	0.74	1	6

To further complicate this, we can include many other options. However, for simplicity, we are going to focus on the most important ones. First, we are going give the table a title. Second, we are going to rename the variables. Lastly, we are going to format some of the numbers. Take a look:

Code

title <- "This is the Table's Caption/Title"
frmla <- (`Price` = price) + (`Lot Size` = lotsize) + (`Bedrooms` = bedrooms) ~
  (`N` = length) + Mean + (`St. Dev.` = sd) + (Min = min) + (Max = max)
datasummary(frmla, data = df, title = title,
            fmt = fmt_significant(2))

This is the Table's Caption/Title
	N	Mean	St. Dev.	Min	Max
Price	546	68122	26703	25000	190000
Lot Size	546	5150	2168	1650	16200
Bedrooms	546	3	0.74	1	6