Foundations
Module 1.1: Types of Data
All materials can be found at alexcardazzi.github.io.
Data, Data, Data
Nowadays, data is (are?) everywhere. Data has become a buzzword for those in business, policy, government, industry, science, etc. Understanding how to work with data, therefore, is becoming an increasingly valuable and marketable skill. Employers want to hire people who can use data.
However, everyone “knows” Excel. Everyone “knows” what a correlation is. So, how can you differentiate yourself from others? Put differently, how can you send a credible signal to employers that you really know data?
This course will (re-)introduce you to statistics, econometrics, and R.
Population vs Sample
A statistic is a measurement from some data.
- Statistics can come from either populations or samples:
- Population: The entire possible group of observations. For example, the population of students in the college of business.
- Sample: Any subset of students from the population. For example, Economics majors or business school juniors.
Types of Statistics
Descriptive statistics are meant to organize, summarize, and present data in ways that helps people understand key facts about the data.
Inferential statistics are descriptive statistics that are used to estimate properties of a population based on a sample.
Random Samples
Suppose I want to know the average height of ODU students. Asking all 20,000+ students how tall they are would be very expensive (time + cost + effort), so I need to do something else.
- It’s very easy to find heights of ODU athletes online. Most rosters include heights.
- For example, perhaps the average height of the men’s basketball team is 6’5”. This is an interesting descriptive statistic, but a poor inferential statistic since basketball players are notoriously taller than average.
- Maybe I could ask everyone in this course how tall they are.
- The average height of you all would be a much better inferential statistic.
Qualitative vs Quantitative
Statistics can be qualitative or quantitative.
Statistics that fall under this category are meant to describe characteristics or traits of something that are not naturally quantifiable. Examples include eye color, nationality, etc.
Quantitative statistics are numerical properties of some data. Examples include height, income, etc.
Types of Measurement
Statistics can also be broken down into what type of measurements they are. Four important distinctions are:
- Nominal
- Ordinal
- Interval
- Ratio
Nominal: Nominal data is represented by labels or names. In addition, there is no natural order to these data. An example would be “language spoken”, and responses might be “english”, “spanish”, “japanese”, “italian”, etc. Nominal data is therefore always qualitative.
Ordinal: Ordinal data is recorded in reference to some relative ranking. There is indeed an order, unlike nominal data, but the numbers representing the order do not have any other meaning. For example: a list of the fastest 100 meter dash times. The difference between #1 and #2 is not the same as the difference between #8 and #9.
Interval: For interval data, the distance between numbers is indeed meaningful, and there must be units that accompany the measurements. However, zero is usefully “meaningless”. For example: Fahrenheit or dress sizes. In both cases, the difference between any two numbers is the same. At the same time, zero degrees Fahrenheit or a size zero dress does not mean an absence of temperature or fabric.
Ratio: Ratio measurements are interval measurements, except zero has a natural meaning. When zero has a natural meaning, the ratio of two measurements also has meaning. For example, consider income. Someone with $50 has twice as much as someone with $25. For Fahrenheit, 50 degrees is not twice as hot as 25 degrees. In addition, having $0 means you have no money (and therefore negatives mean something too).
Discrete vs Continuous
When measurement is quantitative, the measurement can be either discrete or continuous.
Discrete: Discrete measurements do not allow for fractions or decimals. Only things that can be measured as integers (1, 2, 3, …) qualify as discrete. For example, the number of followers you have on social media is discrete because you cannot have half of a follower.
Continuous: Continuous data can be sub divided into non-whole numbers. For example: time is a continuous measurement because you can have 15.623941004 seconds. Income can be considered continuous even though dollars only go to two decimal places – it is close enough that people would consider it continuous.
In reality, data sets come with rows and columns. Often, each column will contain a different type of measurement, whether it be qualitative or quantitative. In econometrics, there are three main ways to organize data. Usually, we have units (states, firms, individuals, etc.) and time (year, month, quarter, day)
- Cross Section: data collected for many units at the same (or similar) time.
- Time Series: data collected for one unit over many time periods.
- Panel: data collected for many units over many time periods.
- When all units are observed for each time period (50 states over 20 years), the panel is said to be balanced.
- When some units are observed more often than others, this is an unbalanced panel.
- Sometimes, people observe totally different, unique units each time period. For example, observing the students in Econ 311 over time. This is called a pooled cross section. Essentially, the data contains many cross sections all pooled together.
R and RStudio
In this course, students will learn to program in the language R
. This will be new for most students, but it is a powerful tool that will help you stand out on the job market.
Many jobs also ask for applicants to be knowledgeable in Excel. Of course, this is a widely used software, and strong knowledge is definitely important. As a challenge, however, let’s try to not open Excel. If you can put that constraint on yourself, learning R
will be easier.
You should think of learning R
as learning a foreign language. Immersion is the best way to learn a foreign language, and R
is no different. If you cut out Excel, you will be forced to use R
, and you will improve faster. Or, if you have to do something in Excel, try to replicate it in R
.