Datasets often contain predictors of type character with useful information. Examples are addresses, telephone numbers, email addresses etc. Regular expressions allow us to extract useful information from strings. For example, extracting the postal code from an address.
Consider this toy dataset df_shop
about purchasing behaviour of customers.
> df_shop
description
1 Drew has 3 watermelons
2 Alex has 4 hamburgers
3 Karina has 12 tamales
4 Anna has 6 soft pretzels
Suppose we want to know who bought hamburgers. R has 2 functions to check whether a pattern is present in a character vector:
grep(value = FALSE)
: returns the indices of the matches.grep(value = TRUE)
: returns the elements of the matches.grepl()
: returns a logical vector of the matches.In this case, grepl()
is most appropriate because we get a boolean value for each row, indicating whether the person bought hamburgers.
> grep("hamburger", df_shop$description, value = FALSE)
[1] 2
> grep("hamburger", df_shop$description, value = TRUE)
[1] "Alex has 4 hamburgers"
> grepl("hamburger", df_shop$description)
[1] FALSE TRUE FALSE FALSE
> df_shop$has_hamburger <- grepl("hamburger", df_shop$description)
More interesting would be to extract information from the string. This can be done with the regexec()
function.
regexec()
: returns starting position and length of match.
Let’s say we want to extract the quantity of the purchase, using the regular expression "\\d+"
.
That is, we are looking for a digit of one or multiple characters. Note that we need to escape the backslash in “\d”.
Next, we can extract the match with the regmatches()
function.
regmatches()
: extract match
> idx_regexec <- regexec(pattern = "\\d+", text = "Drew has 3 watermelons")
> match_regexec <- regmatches("Drew has 3 watermelons", idx_regexec)
> match_regexec
[[1]]
[1] "3"
We can apply this procedure to the entire column. Note that we need to get the integer from the nested list with an sapply()
function.
> match <- regmatches(df_shop$description, regexec("\\d+", df_shop$description))
> df_shop$quantity <- sapply(match, `[`, 1)
> df_shop$quantity
[1] "3" "4" "12" "6"
If we want to extract multiple values from the string, we need to use capture groups. Let’s say we want to extract name, quantity and product. Therefore, we need a regex “(\w+)\s\w+\s(\d+)\s(\w+\s*\w*)”. Try to identity how the different word and whitespace classes map to the purchasing string.
> idx_regexec <- regexec("(\\w+)\\s\\w+\\s(\\d+)\\s(\\w+\\s*\\w*)", "Drew has 3 watermelons")
> match <- regmatches("Drew has 3 watermelons", idx_regexec)
> match
[[1]]
[1] "Drew has 3 watermelons" "Drew" "3" "watermelons"
Similarly, apply this to the entire column:
> idx_regexec <- regexec("(\\w+)\\s\\w+\\s(\\d+)\\s(\\w+\\s*\\w*)", df_shop$description)
> match <- regmatches(df_shop$description, idx_regexec)
> df_shop$name <- sapply(match, `[`, 2)
> df_shop$quantity <- sapply(match, `[`, 3)
> df_shop$product <- sapply(match, `[`, 4)
> df_shop
description has_hamburger quantity name product
1 Drew has 3 watermelons FALSE 3 Drew watermelons
2 Alex has 4 hamburgers TRUE 4 Alex hamburgers
3 Karina has 12 tamales FALSE 12 Karina tamales
4 Anna has 6 soft pretzels FALSE 6 Anna soft pretzels
Consider this toy dataframe on housing. Your goal is to extract the name, number of floors, and housing type.
df_house <- data.frame(description=c("Peter lives in a 2-story house",
"Eva lives in a 3-story appartment" ,
"Justin lives in a 1-story bungalow"))
"Peter lives in a 2-story house"
.
Store the result of the regexec()
function in idx_regexec_try
. Store the result of the regmatches()
function in match_try
.description
. Store the result of the regexec()
function in idx_regexec
. Store the result of the regmatches()
function in match
.sapply()
function to get the information out the nested list. You should add new columns name
, floors
, and housing_type
(in this order).