Datasets often contain predictors of type character with useful information. Examples are addresses, telephone numbers, email addresses etc. Regular expressions allow us to extract useful information from strings. For example, extracting the postal code from an address.

Consider this toy dataset df_shop about purchasing behaviour of customers.

> df_shop
               description
1   Drew has 3 watermelons
2    Alex has 4 hamburgers
3    Karina has 12 tamales
4 Anna has 6 soft pretzels



grep & grepl

Suppose we want to know who bought hamburgers. R has 2 functions to check whether a pattern is present in a character vector:

In this case, grepl() is most appropriate because we get a boolean value for each row, indicating whether the person bought hamburgers.

> grep("hamburger", df_shop$description, value = FALSE)
[1] 2
> grep("hamburger", df_shop$description, value = TRUE)
[1] "Alex has 4 hamburgers"
> grepl("hamburger", df_shop$description)
[1] FALSE  TRUE FALSE FALSE
> df_shop$has_hamburger <- grepl("hamburger", df_shop$description)



regexec: extract single value

More interesting would be to extract information from the string. This can be done with the regexec() function.
regexec(): returns starting position and length of match.

Let’s say we want to extract the quantity of the purchase, using the regular expression "\\d+". That is, we are looking for a digit of one or multiple characters. Note that we need to escape the backslash in “\d”.

Next, we can extract the match with the regmatches() function.
regmatches(): extract match

> idx_regexec <- regexec(pattern = "\\d+", text = "Drew has 3 watermelons")
> match_regexec <- regmatches("Drew has 3 watermelons", idx_regexec)
> match_regexec
[[1]]
[1] "3"

We can apply this procedure to the entire column. Note that we need to get the integer from the nested list with an sapply() function.

> match <- regmatches(df_shop$description, regexec("\\d+", df_shop$description))
> df_shop$quantity <- sapply(match, `[`, 1)
> df_shop$quantity
[1] "3"  "4"  "12" "6" 



regexec: capture groups

If we want to extract multiple values from the string, we need to use capture groups. Let’s say we want to extract name, quantity and product. Therefore, we need a regex “(\w+)\s\w+\s(\d+)\s(\w+\s*\w*)”. Try to identity how the different word and whitespace classes map to the purchasing string.

> idx_regexec <- regexec("(\\w+)\\s\\w+\\s(\\d+)\\s(\\w+\\s*\\w*)", "Drew has 3 watermelons")
> match <- regmatches("Drew has 3 watermelons", idx_regexec)
> match
[[1]]
[1] "Drew has 3 watermelons" "Drew" "3" "watermelons" 

Similarly, apply this to the entire column:

> idx_regexec <- regexec("(\\w+)\\s\\w+\\s(\\d+)\\s(\\w+\\s*\\w*)", df_shop$description)
> match <- regmatches(df_shop$description, idx_regexec)
> df_shop$name <- sapply(match, `[`, 2)
> df_shop$quantity <- sapply(match, `[`, 3)
> df_shop$product <- sapply(match, `[`, 4)
> df_shop
               description has_hamburger quantity   name       product
1   Drew has 3 watermelons         FALSE        3   Drew   watermelons
2    Alex has 4 hamburgers          TRUE        4   Alex    hamburgers
3    Karina has 12 tamales         FALSE       12 Karina       tamales
4 Anna has 6 soft pretzels         FALSE        6   Anna soft pretzels



Questions

Consider this toy dataframe on housing. Your goal is to extract the name, number of floors, and housing type.

df_house <- data.frame(description=c("Peter lives in a 2-story house", 
                                     "Eva lives in a 3-story appartment" , 
                                     "Justin lives in a 1-story bungalow"))
  1. Try to find a regular expression that matches the first string "Peter lives in a 2-story house". Store the result of the regexec() function in idx_regexec_try. Store the result of the regmatches() function in match_try.
  2. Now, use the regex to handle the entire column description. Store the result of the regexec() function in idx_regexec. Store the result of the regmatches() function in match.
  3. Use the sapply() function to get the information out the nested list. You should add new columns name, floors, and housing_type (in this order).