R provides a powerful and convenient way of indexing vectors. We can, for example, subset a vector based on properties of another vector. In this section, we continue working with our US murders example, which we can load like this:
library(dslabs)
data("murders")
We have now calculated the murder rate using:
murder_rate <- murders$total / murders$population * 100000
Imagine you are moving from Italy where, according to an ABC news report, the murder rate is only 0.71 per 100,000. You would prefer to move to a state with a similar murder rate. Another powerful feature of R is that we can use logicals to index vectors. If we compare a vector to a single number, it actually performs the test for each entry. The following is an example related to the question above:
ind <- murder_rate < 0.71
If we instead want to know if a value is less or equal, we can use:
ind <- murder_rate <= 0.71
Note that we get back a logical vector with TRUE
for each entry
smaller than or equal to 0.71. To see which states these are, we can
leverage the fact that vectors can be indexed with logicals.
murders$state[ind]
#> [1] "Hawaii" "Iowa" "New Hampshire" "North Dakota"
#> [5] "Vermont"
In order to count how many are TRUE, the function sum
returns the sum
of the entries of a vector and logical vectors get coerced to numeric
with TRUE
coded as 1 and FALSE
as 0. Thus we can count the states
using:
sum(ind)
#> [1] 5
Suppose we like the mountains and we want to move to a safe state in the
western region of the country. We want the murder rate to be at most 1.
In this case, we want two different things to be true. Here we can use
the logical operator and, which in R is represented with &
. This
operation results in TRUE
only when both logicals are TRUE
. To see
this, consider this example:
TRUE & TRUE
#> [1] TRUE
TRUE & FALSE
#> [1] FALSE
FALSE & FALSE
#> [1] FALSE
For our example, we can form two logicals:
west <- murders$region == "West"
safe <- murder_rate <= 1
and we can use the &
to get a vector of logicals that tells us which
states satisfy both conditions:
ind <- safe & west
murders$state[ind]
#> [1] "Hawaii" "Idaho" "Oregon" "Utah" "Wyoming"
which
Suppose we want to look up California’s murder rate. For this type of
operation, it is convenient to convert vectors of logicals into indexes
instead of keeping long vectors of logicals. The function which
tells
us which entries of a logical vector are TRUE. So we can type:
ind <- which(murders$state == "California")
murder_rate[ind]
#> [1] 3.37
match
If instead of just one state we want to find out the murder rates for
several states, say New York, Florida, and Texas, we can use the
function match
. This function tells us which indexes of a second
vector match each of the entries of a first vector:
ind <- match(c("New York", "Florida", "Texas"), murders$state)
ind
#> [1] 33 10 44
Now we can look at the murder rates:
murder_rate[ind]
#> [1] 2.67 3.40 3.20
%in%
If rather than an index we want a logical that tells us whether or not
each element of a first vector is in a second, we can use the function
%in%
. Let’s imagine you are not sure if Boston, Dakota, and Washington
are states. You can find out like this:
c("Boston", "Dakota", "Washington") %in% murders$state
#> [1] FALSE FALSE TRUE
Note that we will be using %in%
often throughout the book.
Advanced: There is a connection between match
and %in%
through
which
. To see this, notice that the following two lines produce the
same index (although in different order):
match(c("New York", "Florida", "Texas"), murders$state)
#> [1] 33 10 44
which(murders$state%in%c("New York", "Florida", "Texas"))
#> [1] 10 33 44