R provides a powerful and convenient way of indexing vectors. We can, for example, subset a vector based on properties of another vector. In this section, we continue working with our US murders example, which we can load like this:

library(dslabs)
data("murders")

Subsetting with logicals

We have now calculated the murder rate using:

murder_rate <- murders$total / murders$population * 100000 

Imagine you are moving from Italy where, according to an ABC news report, the murder rate is only 0.71 per 100,000. You would prefer to move to a state with a similar murder rate. Another powerful feature of R is that we can use logicals to index vectors. If we compare a vector to a single number, it actually performs the test for each entry. The following is an example related to the question above:

ind <- murder_rate < 0.71

If we instead want to know if a value is less or equal, we can use:

ind <- murder_rate <= 0.71

Note that we get back a logical vector with TRUE for each entry smaller than or equal to 0.71. To see which states these are, we can leverage the fact that vectors can be indexed with logicals.

murders$state[ind]
#> [1] "Hawaii"        "Iowa"          "New Hampshire" "North Dakota" 
#> [5] "Vermont"

In order to count how many are TRUE, the function sum returns the sum of the entries of a vector and logical vectors get coerced to numeric with TRUE coded as 1 and FALSE as 0. Thus we can count the states using:

sum(ind)
#> [1] 5

Logical operators

Suppose we like the mountains and we want to move to a safe state in the western region of the country. We want the murder rate to be at most 1. In this case, we want two different things to be true. Here we can use the logical operator and, which in R is represented with &. This operation results in TRUE only when both logicals are TRUE. To see this, consider this example:

TRUE & TRUE
#> [1] TRUE
TRUE & FALSE
#> [1] FALSE
FALSE & FALSE
#> [1] FALSE

For our example, we can form two logicals:

west <- murders$region == "West"
safe <- murder_rate <= 1

and we can use the & to get a vector of logicals that tells us which states satisfy both conditions:

ind <- safe & west
murders$state[ind]
#> [1] "Hawaii"  "Idaho"   "Oregon"  "Utah"    "Wyoming"

`which`

Suppose we want to look up California’s murder rate. For this type of operation, it is convenient to convert vectors of logicals into indexes instead of keeping long vectors of logicals. The function which tells us which entries of a logical vector are TRUE. So we can type:

ind <- which(murders$state == "California")
murder_rate[ind]
#> [1] 3.37

`match`

If instead of just one state we want to find out the murder rates for several states, say New York, Florida, and Texas, we can use the function match. This function tells us which indexes of a second vector match each of the entries of a first vector:

ind <- match(c("New York", "Florida", "Texas"), murders$state)
ind
#> [1] 33 10 44

Now we can look at the murder rates:

murder_rate[ind]
#> [1] 2.67 3.40 3.20

`%in%`

If rather than an index we want a logical that tells us whether or not each element of a first vector is in a second, we can use the function %in%. Let’s imagine you are not sure if Boston, Dakota, and Washington are states. You can find out like this:

c("Boston", "Dakota", "Washington") %in% murders$state
#> [1] FALSE FALSE  TRUE

Note that we will be using %in% often throughout the book.

Advanced: There is a connection between match and %in% through which. To see this, notice that the following two lines produce the same index (although in different order):

match(c("New York", "Florida", "Texas"), murders$state)
#> [1] 33 10 44
which(murders$state%in%c("New York", "Florida", "Texas"))
#> [1] 10 33 44