One Y Variable

Module 2.4: Functions and Loops

Author
Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

Functions

In this portion of the module, we’re going to learn more about R. First, we’re going to learn how to write our own functions. R is open source, meaning that people can write software for others (or themselves) to use. For example, modelsummary was written (and a lot of the data we’ll use are provided) by Vincent Arel-Bundock.

Functions in R, or computer programming languages more broadly, take some input (called arguements), do stuff to that input, and then return some output. This is the same as a mathematical function. Consider \(Y = mX + b\). In this function, \(X\) is the input, the “stuff” is multiplying by \(m\) and adding \(b\), and the output is \(Y\).

Let’s write our first function, which we’ll call plus2. The purpose of plus2 is to add 2 to any number we provide. Try playing with the code below:

Solution
Code
plus2("alex")
Error in x + 2: non-numeric argument to binary operator
Code
plus2()
Error in plus2(): argument "x" is missing, with no default
Code
plus2(c(1:5))
Output
[1] 3 4 5 6 7

Notice how plus2 accepts a single argument (x), and returns a value of x + 2. We can define a slightly more complex function that accepts two different arguments. We’ll call this function plusY, and it will accept x and y arguments that it will then add together. Try building this function below:

Solution
Code
plusY <- function(x, y){
  
  return(x + y)
}

plusY(10, 6)
plusY(c(1:5), c(6:10))
plusY(c(1:4), c(100, 200))
Output
[1] 16
[1]  7  9 11 13 15
[1] 101 202 103 204

What happens if you only give plusY() a single argument instead of two? This would generate an error – check for yourself. There are a few ways we can fix this. One way is to set default values for the arguments in our function. Below, edit plusY() so y = 0 in the function, which will set zero as y’s default.

Solution
Code
plusY <- function(x, y = 0){
  
  return(x + y)
}
plusY(10, 2)
plusY(10)
Output
[1] 12
[1] 10

Of course, we can set both defaults to zero. Then, we could technically call plusY() without any inputs, and it will return 0.

What if it doesn’t make sense to set a default? Or, maybe you want to make sure the two arguments “make sense”. For example:

Code
plusY("alex", 2)
Error in x + y: non-numeric argument to binary operator

Sometimes, we want a way to “check” the values of arguments so things like the above don’t happen. This is an instance where you could use R’s built-in function if(). This will allow for your program to “make decisions” based on pre-programmed logic.

I will demonstrate how to use if() in our plusY() function:

Code
plusY <- function(x, y = 0){
  
  if(is.numeric(x) & is.numeric(y)){
    
    # if both are numeric, continue to the sum:
    return(x + y)
  } else { # note: if() does not NEED an else.
    
    # otherwise, print the issue:
    print("One of either x or y are non-numeric.")
  }
}

plusY(10, 2)
plusY("alex", 2)
Output
[1] 12
[1] "One of either x or y are non-numeric."

Let’s practice by making a function called summary2(). We want this function to calculate sample size, mean, standard deviation, minimum, and maximum. We also want functionality like na.rm, except we want it to always be true. In addition, the function should return the number of observations it had to drop because of missingness. Our “practice” vector will be c(6, NA, 8, 2, NA, NA, 9, 5, 1, 4).

Our first step should be to figure out which observations are missing. We can do this with is.na(). This will return a Boolean vector equal telling which elements are missing.

Code
y <- c(6, NA, 8, 2, NA, NA, 9, 5, 1, 4)
is.na(y)
Output
 [1] FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE

Now that we know which are missing, calculate how many are missing, and then remove them.

Solution
Code
n_missing <- sum(is.na(y))
# subset y to only the non-NA observations
#   then save over y
y <- y[!is.na(y)]

Now we can focus on returning the statistical information we’re interested in.

Code
# Here, I am naming each element in the vector.
c("Sample Size" = length(y),
  "NA Obs." = n_missing,
  "Mean" = mean(y),
  "St. Dev." = sd(y),
  "Min." = min(y),
  "Max." = max(y))
Output
Sample Size     NA Obs.        Mean    St. Dev.        Min.        Max. 
    7.00000     3.00000     5.00000     2.94392     1.00000     9.00000 

Try putting all of this together into a function:

Solution
Code
summary2 <- function(x){
  
  n_missing <- sum(is.na(x))
  x <- x[!is.na(x)]
  c("Sample Size" = length(x),
    "NA Obs." = n_missing,
    "Mean" = mean(x),
    "St. Dev." = sd(x),
    "Min." = min(x),
    "Max." = max(x)) -> result
  return(result)
}

y <- c(6, NA, 8, 2, NA, NA, 9, 5, 1, 4)
summary2(x = y)
# generate 10,000 random draws from c(1:9, NA)
y <- sample(c(1:9, NA), 10000, replace = TRUE)
summary2(x = y)
Output
Sample Size     NA Obs.        Mean    St. Dev.        Min.        Max. 
    7.00000     3.00000     5.00000     2.94392     1.00000     9.00000 
Sample Size     NA Obs.        Mean    St. Dev.        Min.        Max. 
8992.000000 1008.000000    4.980761    2.595344    1.000000    9.000000 

Loops

Next, we will introduce how to write loops in R.

Suppose we are interested in executing the same code over and over and over. As an example, suppose I wanted to write some code to print every student name and grade. Of course, I could write the following:

Code
cat("Name:", "Alex", "... Grade:", "B", "\n")
cat("Name:", "Brooke", "... Grade:", "A", "\n")
cat("Name:", "Carlos", "... Grade:", "A", "\n")
cat("Name:", "Dasia", "... Grade:", "B", "\n")
cat("Name:", "Enzo", "... Grade:", "C")
Output
Name: Alex ... Grade: B 
Name: Brooke ... Grade: A 
Name: Carlos ... Grade: A 
Name: Dasia ... Grade: B 
Name: Enzo ... Grade: C

This code works, but there are some problems. First, writing this out so many times makes it prone to typos, even if just copying and pasting. Second, almost anything that is repetitive or has an identifiable pattern is easier for a computer than for a human. Lastly, what if there were 100 names instead of 5? What if 10,000? This is where loops come in.

Our first step is going to be creating vectors of names and grades. Ideally, you’d have these data in a spreadsheet/csv and could easily read it in using read.csv().

Code
namez <- c("Alex", "Brooke", "Carlos", "Dasia", "Enzo")
gradez <- c("B", "A", "A", "B", "C")

Second, we’re going to write our loop. The loop needs two things:

  1. an iterator. Usually, people use i but it can be anything, of course.
  2. loop bounds. This is the only thing that will change during each repetition. Since we have 5 names and grades, we’re going to loop over elements 1 through 5. We’ll use 1:5, or 1:length(namez) to be even more flexible.
Code
# for(iterator in bounds)
# everything between { and } will be looped
for(i in 1:length(namez)){
  
  
}

The iterator and bounds in for loops are similar to the iterator and bounds in \(\sum_{i = 1}^n\)

Let’s just fill the loop with a simple printing statement to illustrate what the loop does.

Remember, to access the first name in namez, we would use the following: namez[1]. Similarly, namez[2] would return the second element, and namez[length(namez)] would give the last element. Instead of putting a specific element in the square brackets, we can put the index variable there. Then, we can put this into our loop:

Finally, we can put this all together and generate our initial output.

Now, consider if we had many more names and grades. We could have thousands of names and grades and our little for loop would remain the same!

We are not limited to just numeric iterators/bounds. Sometimes, using non-numeric ones is helpful too:

We can also use loops to create or modify data. As an example, we’re going to write a function that returns the nth element in the Fibonacci Sequence.

Element \(n\) in the Fibonacci Sequence is simply a sum of the previous two elements. Explicitly, \(F_n = F_{n-1} + F_{n-2}\). Usually, people start the sequence with 0 and 1, which makes the third element equal to 1 (1 + 0), the fourth element equal to 2 (1 + 1), the fifth element equal to 3 (2 + 1), and so on. Our function should accept some number \(n\) and return that element within the sequence.

To do this, let’s set up the function:

Now, we need to fill in the some code here... part of the function. Again, the formula for any element \(n\) is just \(F_n = F_{n-1} + F_{n-2}\). So, to calculate \(F_n\), we need these other two numbers. However, to get \(F_{n-2}\), for example, we need \(F_{n-3}\) and \(F_{n-4}\). Obviously, this continues back until we arrive at \(F_1\) and \(F_2\). This suggests that we’ll need to calculate each element in the sequence until we arrive at \(F_{n}\).

Let’s start with the third element. Since we have v <- c(0, 1) already, we can write v[3] <- v[2] + v[1]. This can also be written as: v[3] <- v[3-1] + v[3-2]. Once we have v[3] established, we can calculate v[4] as v[3] + v[2], or v[4] <- v[4-1] + v[4-2]. We would repeat this for 5, 6, 7, and all the way until \(n\). Hopefully, you can see that we could also generalize this code to look like v[i] <- v[i-1] + v[i-2] inside of a for loop. Our loop bounds would start at 3 (since elements 1 and 2 are established as 0 and 1 already), and we would continue the loop until \(n\).

Solution
Code
n <- 10
v <- c(0, 1)
for(i in 3:n){
  
  v[i] <- v[i-1] + v[i-2]
}
print(v)
Output
 [1]  0  1  1  2  3  5  8 13 21 34

Once we have this information completed, we can add this into our function! Try below:

Solution
Code
fibonacci <- function(n){
  
  # Start with the typical 0 and 1...
  v <- c(0, 1)
  
  for(i in 3:n){
  
    v[i] <- v[i-1] + v[i-2]
  }
  
  # Return the nth element of v
  return(v[n])
}

fibonacci(10)
Output
[1] 34

We can build “logic” into our loops by adding in if statements, too. Take a look at the following example using names and grades again. Here, I will change what gets printed based on the grade the individual obtained.

Code
for(i in 1:length(namez)){
  
  if(gradez[i] == "A"){
    
    cat("Name:", namez[i], "... crushin' it!", "\n")
  } else if(gradez[i] == "B"){
    
    cat("Name:", namez[i], "... you're doing a good job!", "\n")
  } else {
    
    cat("Name:", namez[i], "... keep studying!", "\n")
  }
}
Output
Name: Alex ... you're doing a good job! 
Name: Brooke ... crushin' it! 
Name: Carlos ... crushin' it! 
Name: Dasia ... you're doing a good job! 
Name: Enzo ... keep studying! 

If you remember, in Module 1 we colored points based on certain characteristics. We did this by first setting all colors to be the same value, and then changed the color of certain observations based on their values. See below.

We can use loops to achieve a similar outcome. You can run the code snippets above and below to see for yourself that the results are the same.

ifelse()

While both of these solutions arrive at the same answer, there is a better option.

First, in general, loops are relatively slow in R. It might not seem so when dealing with small samples, but it becomes noticeable as the data grow larger.

Second, as people often say, “lazy” programming is good programming! We should be writing as little as possible1, without sacrificing coherence/readability, to minimize mistakes, bugs, etc.

Introducing: ifelse(). This is a function that accepts three arguments:

  • test: an object which can be coerced to logical mode. In other words, some logical vector like v > 5
  • yes: return values for true elements of test. In other words, what should be the output when v > 5 is TRUE?
  • no: return values for false elements of test. In other words, what should be the output when v > 5 is FALSE?

You can think of ifelse() as creating a for loop with if() statements inside of it. In fact, we can even nest ifelse statements. For example, consider these data on U.S. Senate vote on the use of force against Iraq in 2002. For each observation, we want to assign some value (here, I chose to assign some text) by party and by vote.

Footnotes

  1. Except for code comments! Always comment your code.↩︎