So far, we have characterized certain “moments” of data (i.e., mean and variance).
While these two numbers are helpful in understanding the data, they do not tell the full story. As an example, let’s consider the following data:
Code
df <-read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/anscombe.csv")cat("Summary Statistics for x1. Mean:", mean(df$x1), "; Variance:", var(df$x1), "\n")cat("Summary Statistics for x4. Mean:", mean(df$x4), "; Variance:", var(df$x4))
Output
Summary Statistics for x1. Mean: 9 ; Variance: 11
Summary Statistics for x4. Mean: 9 ; Variance: 11
Both of these columns have the same average and variance (and thus standard deviation), so you might think that they must in fact look like similar. Let’s take a look at the data and see for ourselves:
Clearly, even those data have the same mean and variance, they have very different distributions. The first column had many values ranging from 4 to 14, but the second column had one 19 and the rest 8s. Note that this is the inverse of the case of Michael Jordan and UNC. There, the distributions were presumably very similar with the exception of Michael Jordan’s single outlier salary.
Before we can keep talking about distributions, though, we need to discuss probability briefly. Put a pin in this idea of distributions for a moment.
Probability
Probability
A value between 0 and 1 inclusive that represents the likelihood a particular event happens.
The probability that a coin is flipped and it lands on tails is \(\frac{1}{2}\) or \(0.5\) or \(50\%\). What about the probabilities and outcomes of flipping two coins and counting the number of tails? Well, since flipping this coin is a random variable, it has a distribution. That distribution is defined by all possible outcomes and the associated probabilities:
Percentages are not really probabilities, but they are often associated with them.
Probability Distribution of Two Coin Flips
Number of Tails, \(y\)
Probability, \(P(y)\)
0
0.25
1
0.50
2
0.25
All distributions have a mean and a variance. So, how do we find the mean and variance of a probability distribution?
In fact, not all distributions have defined means and/or variances, but these types of distributions are outside the scope of this course.
Flipping coins is an example of a discrete distribution. In other words, individual events have non-zero probabilities of occurring. However, when distributions are continuous, the probability of any individual event occurring is precisely zero. What? How can that be true?
Consider the lifespan of a tire on a car. Suppose the average tire lasts about 54,000 miles. However, what is the probability that your tire lasts exactly 54,000 miles?Essentially 0%.
This is because your tire could last 54,000.1 miles, 54,000.01 miles, and so on. There are an infinite number of miles that your tire could last, so the probability of any specific mile being the one where your tire goes flat is zero.
When dealing with continuous probabilities, instead of asking for the \(P(y) = M\) (e.g., the probability that your tire lasts exactly \(M\) miles), people ask for the \(P(y) < M\), \(P(y) > m\), or \(m < P(y) < M\) (e.g., the probability that your tire lasts more or less than some number of miles, or the probability that your tire lasts between two numbers of miles.)
In the interest of being concrete, suppose tires last anywhere between 40,000 and 68,000 miles. Suppose further that the distribution is uniform, meaning that the probability of your tire going flat is the same from 40,000 to 68,000 miles. A question from a statistics class might be: “what is the probability that your tire lasts more than 60,000 miles?”
To answer this, you need to know what fraction of tires last above 60,000. Well, since all tires go flat between 40 and 68, we can say the baseline is 68-40 = 28. Then, 68-60 is the range we’re interested in. So, the probability would be \(\frac{68-60}{68-40} = 0.286\).
The Normal Distribution
The above math highlights how you would deal with probabilities and continuous distributions. However, not all continuous distributions are “uniform” like we assumed above. Perhaps the most famous and widely studied distribution is the normal distribution.
The normal distribution:
is bell-shaped with it’s peak in the center
is symmetric, meaning the mean is the same as the median, and it asymptotically approaches the x-axis (meaning it gets close but never touches it!)
is completely characterized by its mean and standard deviation, and is written like \(N(\mu, \sigma)\)
Below is an example of what a normal distribution looks like:
Plot
Normal distributions take on different shapes depending on their means and standard deviations.
A increasing the mean, while holding the standard deviation constant, results in the distribution sliding horizontally to the right along the x-axis.
An increase in the standard deviation, while holding the mean constant, results in a shorter, wider distribution.
Below are some examples of what these changes would do to a standard normal distribution:
Plot
Normal Distribution Application
To learn the most incredible feature of the normal distribution, you’ll have to wait for later in the module. However, while we wait, let’s discuss how to compare two different normal distributions.
Two particularly famous distributions that are normally distributed are scores on the SAT and ACT. The maximum score for each is 1600 and 36, respectively. The SAT has an average score of 1050 with a standard deviation of 200. The ACT has an average score of 19 with a standard deviation of 6.
So, let’s say someone scores a 1,300 on the SAT and a 29 on the ACT. Which test did they do better on?
Well, let’s first visualize the two distributions and the two scores.
Plot
Unfortunately, this visualization is not so helpful. Since the two distributions have different means and standard deviations, we cannot really compare them.A first step to getting a better comparison would be to subtract off the mean from both distributions. Then, both distributions will have a mean of zero.
Plot
Still, this is not helpful. Now, the variances are distorting the comparison.To fix this, we can divide each by their respective variances. This way, both distributions have a mean of zero and a variance of 1.
Plot
Now, it should be quite obvious which score was “better”. The SAT score currently has a value of 1.25 while the ACT score has a value of 1.6666667.
What do these numbers mean, though? In words, this is the number of standard deviations away from the mean the original score is. This actually has a name, and it is a \(z\)-score. The formula for a \(z\)-score is:
\[
z = \frac{y - \mu}{\sigma}
\]
\(z\)-scores come from what is called a standard normal distribution, which is just a normal distribution with a mean of 0 and standard deviation (and variance) of 1, or \(N(0, 1)\). We know a lot about the standard normal distribution. For example, we know the probability that a random draw from a normal distribution is greater than some number, less than some number, or between two numbers. In most statistics courses, this is where you’d be shown a \(z\)-score table. Rather than spending time on this table, here are some takeaways:
Since the distribution is symmetric, the probability of a random draw being larger (or smaller) than 0 is \(\frac{1}{2}\).
There is a 15.87% chance that a random draw is larger (smaller) than +1 (-1).
There is a 2.28% chance that a random draw is larger (smaller) than +2 (-2).
This means that there is a 95.44% chance that the random draw is between -2 and +2.
Let me add some words to all of this math. Remember, if we picked a random element out of a vector, we would expect that element to be \(\sigma\) units away from the mean.What is the \(z\)-score doing? It is scaling the distance from some value \(y\) to the mean \(\mu\) by a measure of the average distance from the mean, \(\sigma\).Therefore, values of \(z\) that are greater in absolute value than 1 indicate that \(y\) is farther than the average element, and values less then one indicate \(y\) is closer than average. Being more than two times the average distance from the mean, for example, is pretty rare – there is only a 2.28% chance.A first important feature of \(z\)-scores is that they allow for the comparison of values from different distributions (e.g. SAT and ACT).A second important feature is that they allow us to calculate probabilities for every normal distribution, regardless of the mean, variance, and/or units.
Normal Distributions in R
R has some pre-built functions to work with normal distributions.
pnorm() gives \(P(Z < y)\) when lower.tail = TRUE. When lower.tail = FALSE, this gives \(P(Z > y)\)
pnorm(0, lower.tail = TRUE): 0.5
pnorm(2, lower.tail = TRUE): 0.9772499.
pnorm(2, lower.tail = FALSE): 0.0227501.
qnorm() is the opposite of pnorm() in that it converts probabilities into \(z\)-scores.
qnorm(0.9772499, lower.tail = TRUE): 2.0000006
rnorm() generates random draws from a standard normal distribution with a given mean and standard deviation. The default mean is 0 and the default standard deviation is 1. In your first HW, you used a function called sample() that generated random data. This is a function that does a similar task.
Arbitrary Distributions
What about arbitrary distributions that we observe in data? We know how to calculate the mean and variance of a random variable, but it’s helpful to visualize the distribution.
If your data is sufficiently discrete, I prefer to use table(). Of course, “sufficiently discrete” is difficult to define. If your data has many unique values, you may need to transform the data via rounding before using table(). Otherwise, you can use hist() to generate a histogram (which is effectively what table() is doing) or density() to create a kernel density plot. You can think of kernel density plots as smoothed out histograms.
# https://vincentarelbundock.github.io/Rdatasets/doc/AER/HousePrices.htmldata_url <-"https://vincentarelbundock.github.io/Rdatasets/csv/AER/HousePrices.csv"df <-read.csv(data_url)# Keep only columns 2 through 6.# These columns are the ones that contain numeric information.# What this step below is doing is sub-setting the columns,# then saving over the original.df <- df[,2:6]head(df)
Remember the function table()? This function returned the number of times each unique value appeared in the data. This is, definitionally, a distribution! Let’s try combining table() and plot(), starting with bedrooms:
Code
plot(table(df$bedrooms),xlab ="Number of Bedrooms",ylab ="Frequency")
Plot
From this, we can see that the distribution is somewhat symmetric with the most density at 3. For bedrooms in a home, this seems reasonable. Also note that there are no decimal values, because we cannot have 3.5 bedrooms (though you can have 3.5 bathrooms!).
Next, let’s try to plot the distribution of sale price.
This is a much better plot. We can see that were are some outliers that are over 150,000 dollars, and the modal sale is about 60,000 dollars.
Finally, let’s say we want a way to calculate multiple summary statistics (e.g. mean, median, minimum, maximum) for multiple variables at once. Luckily for us, we can use R’s summary() function to help us.
Code
summary(df)
Output
price lotsize bedrooms bathrooms
Min. : 25000 Min. : 1650 Min. :1.000 Min. :1.000
1st Qu.: 49125 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000
Median : 62000 Median : 4600 Median :3.000 Median :1.000
Mean : 68122 Mean : 5150 Mean :2.965 Mean :1.286
3rd Qu.: 82000 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000
Max. :190000 Max. :16200 Max. :6.000 Max. :4.000
stories
Min. :1.000
1st Qu.:1.000
Median :2.000
Mean :1.808
3rd Qu.:2.000
Max. :4.000
However, summary() isn’t our best option for creating a nice and neat table. We are going to use our first R package for this task. Using an R package requires two steps. First, you will need to install the package (download the package from the internet onto your computer), and then call the package (loading the package into the current R session). You only need to install the package once, but you need to call the package every time you use it.Installing/loading packages is something students do incorrectly all the time. Take note of this!
Below, I am installing the modelsummary package (which only needs to be done once!) and then loading it into our session (which needs to be done every time you want to use it!). You can find the documentation for modelsummary here.
Code
# only use install.packages() once per packageinstall.packages("modelsummary")# use this every time you want to use 'modelsummary'library("modelsummary")# add this line when loading 'modelsummary', too# note that you may need to install 'kableExtra' as welloptions("modelsummary_factory_default"="kableExtra")
This package contains a few different functions, but we are going to focus on two: datasummary() and datasummary_skim(). You can find the documentation for datasummary here. To start, let’s see the output of datasummary_skim().
Code
# Other arguements you might want to use:# type = "numeric"# fmt = fmt_significant(2)datasummary_skim(df)
Unique
Missing Pct.
Mean
SD
Min
Median
Max
price
219
0
68121.6
26702.7
25000.0
62000.0
190000.0
lotsize
284
0
5150.3
2168.2
1650.0
4600.0
16200.0
bedrooms
6
0
3.0
0.7
1.0
3.0
6.0
bathrooms
4
0
1.3
0.5
1.0
1.0
4.0
stories
4
0
1.8
0.9
1.0
2.0
4.0
While this very quickly provides us with a lot of useful information, it lacks customization. On the other hand, datasummary() provides us with much more versatility though requires a bit more effort. To start, take a look at the below example. We are summarizing price, lotsize, and bedrooms from the dataset df. For each variable, we are going to calculate length (sample size), mean, sd, min, and max.
Code
datasummary(price + lotsize + bedrooms ~ length + mean + sd + min + max,data = df, fmt =fmt_significant(2))
length
mean
sd
min
max
price
546
68122
26703
25000
190000
lotsize
546
5150
2168
1650
16200
bedrooms
546
3
0.74
1
6
To further complicate this, we can include many other options. However, for simplicity, we are going to focus on the most important ones. First, we are going give the table a title. Second, we are going to rename the variables. Lastly, we are going to format some of the numbers. Take a look:
Code
title <-"This is the Table's Caption/Title"frmla <- (`Price`= price) + (`Lot Size`= lotsize) + (`Bedrooms`= bedrooms) ~ (`N`= length) + Mean + (`St. Dev.`= sd) + (Min = min) + (Max = max)datasummary(frmla, data = df, title = title,fmt =fmt_significant(2))