A typical data analysis will often involve one or more conditional
operations. In Section 3.11 we described the
ifelse
function, which we will use extensively in this book. In this
section we present two dplyr functions that provide further
functionality for performing conditional operations.
case_when
The case_when
function is useful for vectorizing conditional
statements. It is similar to ifelse
but can output any number of
values, as opposed to just TRUE
or FALSE
. Here is an example
splitting numbers into negative, positive, and 0:
x <- c(-2, -1, 0, 1, 2)
case_when(x < 0 ~ "Negative",
x > 0 ~ "Positive",
TRUE ~ "Zero")
#> [1] "Negative" "Negative" "Zero" "Positive" "Positive"
A common use for this function is to define categorical variables based
on existing variables. For example, suppose we want to compare the
murder rates in four groups of states: New England, West Coast,
South, and other. For each state, we need to ask if it is in New
England, if it is not we ask if it is in the West Coast, if not we ask
if it is in the South, and if not we assign other. Here is how we use
case_when
to do this:
murders %>%
mutate(group = case_when(
abb %in% c("ME", "NH", "VT", "MA", "RI", "CT") ~ "New England",
abb %in% c("WA", "OR", "CA") ~ "West Coast",
region == "South" ~ "South",
TRUE ~ "Other")) %>%
group_by(group) %>%
summarize(rate = sum(total) / sum(population) * 10^5)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 4 x 2
#> group rate
#> <chr> <dbl>
#> 1 New England 1.72
#> 2 Other 2.71
#> 3 South 3.63
#> 4 West Coast 2.90
between
A common operation in data analysis is to determine if a value falls
inside an interval. We can check this using conditionals. For example,
to check if the elements of a vector x
are between a
and b
we can
type
x >= a & x <= b
However, this can become cumbersome, especially within the tidyverse
approach. The between
function performs the same operation.
between(x, a, b)