Strings

Strings are what the computer science community calls text data.

paste("thom", "yorke")
## [1] "thom yorke"
paste("thom", "yorke", sep = "-")
## [1] "thom-yorke"
paste0("thom", "yorke")
## [1] "thomyorke"
paste0(c("thom", "jonny"), "_", c("yorke", "greenwood"))
## [1] "thom_yorke"      "jonny_greenwood"
substr("thom yorke", 1, 4)
## [1] "thom"
radiohead <- c("thom yorke", "jonny greenwood")

gsub("o", "_", radiohead)
## [1] "th_m y_rke"      "j_nny greenw__d"
grepl("m", radiohead); grepl("y", radiohead)
## [1]  TRUE FALSE
## [1] TRUE TRUE
regexpr("m", radiohead); regexpr("y", radiohead)
## [1]  4 -1
## attr(,"match.length")
## [1]  1 -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [1] 6 5
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
unlist(strsplit(radiohead, " "))
## [1] "thom"      "yorke"     "jonny"     "greenwood"

These string operations are all relatively simple, but there is much more available to us.

Suppose instead we want to remove two patterns instead of just one. We can use |, which you can think of as the 'or' operator.

gsub("a|b", "", c("aaaa", "bbbb", "cccc", "abc", "babaababbba"))
## [1] ""     ""     "cccc" "c"    ""

Suppose now that you want to only identify strings at the beginning or end.

gsub("a$|^b", "", c("apple", "banana", "beef", "Nokia"))
## [1] "apple" "anan"  "eef"   "Noki"

You can also use this same strategy to remove more general patterns like numbers ([[:digit:]]), letters ([[:alpha:]]), spaces (\\s", [[:space:]]), or punctuation ([[:punct:]]).

There are some characters that are "special", which make this a bit more tricky. For example, $ is a special character as shown above. In order to identify these, you will need two backslashes. For example:

gsub("$", "", c("$1.00", "1 dollar", "one $"))
## [1] "$1.00"    "1 dollar" "one $"
gsub("\\$", "", c("$1.00", "1 dollar", "one $"))
## [1] "1.00"     "1 dollar" "one "

Finally, if you are interested in even more general patterns, you can use the following:

These allow for very general pattern matching. An overly simple example is suppose you want to remove double spaces for a single space. Easy: gsub(" ", " ", vec). However, what if you don't know how many spaces in a row are possible? You can do this: gsub("\\s+", " ", vec). This will replace each string of spaces with one space.

gsub("\\s+", " ", c("alex cardazzi", "   alex", " alex ", "alex        cardazzi"))
## [1] "alex cardazzi" " alex"         " alex "        "alex cardazzi"

Another more general example would be suppose you have a vector of people's names. Some people enter first and last, others enter their first, middle, and last. If you want to remove all middle names, you have to get a bit clever. You might want to gsub everything from one space to the next. Therefore, you need to identify the first space (\\s), tell R "anything else" (.), match "anything else" at least 1 times (+), and then a second space (\\s). How might you write code to do the opposite (only keep the middle name)?

gsub("\\s.+\\s", " ", c("alex john cardazzi", "jeremy william barbara", "nate crawford moon", "jordan scott elam", "julius randle"))
## [1] "alex cardazzi"  "jeremy barbara" "nate moon"      "jordan elam"   
## [5] "julius randle"

Finally, the difference between + and * is that they must match at least 1 or 0 times, respectively. Note the difference below:

gsub("ab+", "", c("abs", "back", "arms", "lower abs"))
## [1] "s"       "back"    "arms"    "lower s"
gsub("ab*", "", c("abs", "back", "arms", "lower abs"))
## [1] "s"       "bck"     "rms"     "lower s"

For lots more, check this out.

For Loop

Computers are really good at doing the same thing over and over again. Anything we tell the computer to repeat, we write as a loop. A trivial example is printing something over and over. We can write print("hello world") N times, once per line. Or, we can write it once and loop over it. Here is how we would do this:

for(i in 1:4){
  
  print("hello world")
}
## [1] "hello world"
## [1] "hello world"
## [1] "hello world"
## [1] "hello world"

We can also incorporate the index variable (i in the previous snippet) into our loop:

namez <- c("alex", "brad", "bryan", "adam")
for(name in namez){
  
  print(paste0("hello ", name, "!"))
}
## [1] "hello alex!"
## [1] "hello brad!"
## [1] "hello bryan!"
## [1] "hello adam!"

For loops run a pre-specified number of times. Maybe not 1:4, but maybe 1:nrow(df), or 1:length(x). Even though you might not be able to observe the number of iterations before entering the loop, it is still predefined. while loops execute until some condition is met. This introduces the potential for an infinite loop! See below, but do not run:

i <- 0
while(i < 2){
  
  print("hello")
}

Since i is equal to 0, it will ALWAYS be less than two. Rather, something you could do is this:

i <- 0
iters <- 0
set.seed(845)
while(i < 1.96){
  
  i <- rnorm(1) #create a random number
  iters <- iters + 1
  print(paste0(iters, ": ", round(i, 3)))
}
## [1] "1: -1.253"
## [1] "2: -1.33"
## [1] "3: -1.94"
## [1] "4: 0.293"
## [1] "5: -0.012"
## [1] "6: -0.158"
## [1] "7: -0.13"
## [1] "8: 1.286"
## [1] "9: 0.239"
## [1] "10: 0.3"
## [1] "11: 1.351"
## [1] "12: 0.003"
## [1] "13: -0.281"
## [1] "14: -1.381"
## [1] "15: 0.105"
## [1] "16: 0.355"
## [1] "17: -0.72"
## [1] "18: -0.958"
## [1] "19: -0.126"
## [1] "20: -1.538"
## [1] "21: 1.537"
## [1] "22: -1.612"
## [1] "23: -0.004"
## [1] "24: 1.695"
## [1] "25: 0.384"
## [1] "26: 1.131"
## [1] "27: 0.48"
## [1] "28: 0.166"
## [1] "29: -0.465"
## [1] "30: 0.571"
## [1] "31: 0.985"
## [1] "32: -0.69"
## [1] "33: -0.623"
## [1] "34: -0.876"
## [1] "35: 0.659"
## [1] "36: 0.049"
## [1] "37: 2.841"

If statements

We can also program into our scripts some decision making logic. If some condition is met, we can make the computer do something different than if the condition is not met. We saw this in earlier tutorials where we chose colors based on specific conditions. Let's have the computer introduce itself to people who have names starting with "b", and ignore everyone else.

for(name in namez){
  
  if(substr(name, 1, 1) == "b"){
    
    print(paste0("Nice to meet you, ", name))
  }
}
## [1] "Nice to meet you, brad"
## [1] "Nice to meet you, bryan"

However, ignoring people isn't nice! What if we want the computer to talk to people with names that doesn't start with "b".

for(name in namez){
  
  if(substr(name, 1, 1) == "b"){
    
    print(paste0("Nice to meet you ", name))
  } else {
    
    print(paste0("hi ", name))
  }
}
## [1] "hi alex"
## [1] "Nice to meet you brad"
## [1] "Nice to meet you bryan"
## [1] "hi adam"

I can also compose multiple conditions, and have one part of code that always executes at the end:

for(name in namez){
  
  if(substr(name, 1, 1) == "b"){
    
    print(paste0("Nice to meet you ", name))
  } else if(name == "adam") {
    
    print(paste0("hi ", name))
  } else {
    
    print(paste0("bye ", name))
  }
  
  print("okay, i am going to meet the next person now")
}
## [1] "bye alex"
## [1] "okay, i am going to meet the next person now"
## [1] "Nice to meet you brad"
## [1] "okay, i am going to meet the next person now"
## [1] "Nice to meet you bryan"
## [1] "okay, i am going to meet the next person now"
## [1] "hi adam"
## [1] "okay, i am going to meet the next person now"

ifelse()

Loops and if statements are great, but they are notoriously "slow"" in R. Rather, we have this nifty function called ifelse(). If you are familiar with using if statements in excel, this is exactly that, but for the whole vector. This is enormously helpful for all sorts of things. These functions can also be nested.

ifelse(substr(namez, 1, 1) == "b", 1, 0)
## [1] 0 1 1 0
#if first letter is b, 1, otherwise if last letter is m, 2, otherwise 0.
ifelse(substr(namez, 1, 1) == "b", 1, ifelse(substr(namez, nchar(namez), nchar(namez)) == "m", 2, 0))
## [1] 0 1 1 2

Sometimes loops are necessary, but try to avoid them if you can.

Functions

R has many predefined functions that we can use. However, you can write your own functions too! Let's say you want to write a function that takes the average of a vector. You need to know the sum and the number of observations.

custom_mean <- function(x){
  
  num <- sum(x)
  denom <- length(x)
  
  return(num/denom)
}

my_vec <- c(1, 2, 3, 4, 5, 10)
custom_mean(my_vec)
## [1] 4.166667

Nice! What if there is a missing value?

my_vec <- c(1, 2, 3, 4, 5, 10, NA)
custom_mean(my_vec)
## [1] NA

Uh oh. We need to tell the function what to do with missing values! How might we fix this?

custom_mean <- function(x, handle_na = "warn"){
  
  if(handle_na == "ignore"){
    
    x <- x[!is.na(x)]
  } else if(handle_na == "zero"){
    
    x[is.na(x)] <- 0
  } else if(handle_na == "random"){
    
    x[is.na(x)] <- sample(min(x, na.rm = T):max(x, na.rm = T), sum(is.na(x)), TRUE)
  } else if(handle_na == "warn"){
    
    print("dude, you have to use an arguement of ignore, zero or random!")
    return(NA)
  }
  
  num <- sum(x)
  denom <- length(x)
  
  return(num/denom)
}

my_vec <- c(1, 2, 3, 4, 5, 10, NA)
custom_mean(my_vec)
## [1] "dude, you have to use an arguement of ignore, zero or random!"
## [1] NA
custom_mean(my_vec, "ignore")
## [1] 4.166667
custom_mean(my_vec, "zero")
## [1] 3.571429
custom_mean(my_vec, "random")
## [1] 4.428571

What if we want to do something crazy? For example, we still want to take an average, but instead we want the harmonic average. Further, for all negative values, we want to square them before anything! How might we write something like this?

crazy_func <- function(x){
  
  x1 <- ifelse(x < 0, x*x, x) #square neg
  num <- length(x1)
  denom <- sum(1/x1)
  
  return(num/denom)
}

You might be noticing two things about this. The first is that I am labeling things within the function differently from my script! This is because I want the function to live by itself, completely independent from the rest of my script. Second, let's talk about scoping. If people tell you that = and <- are the same, they're right 99% of the time. But how terrible would it be to wrong for that 1% when it might really matter? For the TLDR: use = when passing arguements to a function (ex: custom_mean(x = vec)) and <- when saving variables for good (ex: custom_mean <- function(...)). Here is an illustration:

length(my_vec)
## [1] 7
length(x)
## Error in eval(expr, envir, enclos): object 'x' not found
custom_mean(x = my_vec, "ignore")
## [1] 4.166667
length(x)
## Error in eval(expr, envir, enclos): object 'x' not found
custom_mean(x <- my_vec, "ignore")
## [1] 4.166667
length(x)
## [1] 7

Further, once you return() from the function, everything inside it is deleted (ex: num, and denom).

Homework

Use this vector: c("alex john cardazzi", "jeremy william barbara", "nate crawford moon", "jordan scott elam", "julius randle"))

timeNow <- Sys.time()

## password cracker here

print(Sys.time() - timeNow)
pyramid(the_char = "a", size = 5)
## [1] "a"
## [1] "aa"
## [1] "aaa"
## [1] "aaaa"
## [1] "aaaaax"
## [1] "aaaa"
## [1] "aaa"
## [1] "aa"
## [1] "a"