Now that we have mastered some basic R knowledge, let’s try to gain some insights into the safety of different states in the context of gun murders.
sortSay we want to rank the states from least to most gun murders. The
function sort sorts a vector in increasing order. We can therefore see
the largest number of gun murders by typing:
library(dslabs)
data(murders)
sort(murders$total)
#> [1] 2 4 5 5 7 8 11 12 12 16 19 21 22
#> [14] 27 32 36 38 53 63 65 67 84 93 93 97 97
#> [27] 99 111 116 118 120 135 142 207 219 232 246 250 286
#> [40] 293 310 321 351 364 376 413 457 517 669 805 1257
However, this does not give us information about which states have which murder totals. For example, we don’t know which state had 1257.
orderThe function order is closer to what we want. It takes a vector as
input and returns the vector of indexes that sorts the input vector.
This may sound confusing so let’s look at a simple example. We can
create a vector and sort it:
x <- c(31, 4, 15, 92, 65)
sort(x)
#> [1] 4 15 31 65 92
Rather than sort the input vector, the function order returns the
index that sorts input vector:
index <- order(x)
x[index]
#> [1] 4 15 31 65 92
This is the same output as that returned by sort(x). If we look at
this index, we see why it works:
x
#> [1] 31 4 15 92 65
order(x)
#> [1] 2 3 1 5 4
The second entry of x is the smallest, so order(x) starts with 2.
The next smallest is the third entry, so the second entry is 3 and so
on.
How does this help us order the states by murders? First, remember that
the entries of vectors you access with $ follow the same order as the
rows in the table. For example, these two vectors containing state names
and abbreviations, respectively, are matched by their order:
murders$state[1:6]
#> [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
#> [6] "Colorado"
murders$abb[1:6]
#> [1] "AL" "AK" "AZ" "AR" "CA" "CO"
This means we can order the state names by their total murders. We first obtain the index that orders the vectors according to murder totals and then index the state names vector:
ind <- order(murders$total)
murders$abb[ind]
#> [1] "VT" "ND" "NH" "WY" "HI" "SD" "ME" "ID" "MT" "RI" "AK" "IA" "UT"
#> [14] "WV" "NE" "OR" "DE" "MN" "KS" "CO" "NM" "NV" "AR" "WA" "CT" "WI"
#> [27] "DC" "OK" "KY" "MA" "MS" "AL" "IN" "SC" "TN" "AZ" "NJ" "VA" "NC"
#> [40] "MD" "OH" "MO" "LA" "IL" "GA" "MI" "PA" "NY" "FL" "TX" "CA"
According to the above, California had the most murders.
max and which.maxIf we are only interested in the entry with the largest value, we can
use max for the value:
max(murders$total)
#> [1] 1257
and which.max for the index of the largest value:
i_max <- which.max(murders$total)
murders$state[i_max]
#> [1] "California"
For the minimum, we can use min and which.min in the same way.
Does this mean California is the most dangerous state? In an upcoming
section, we argue that we should be considering rates instead of totals.
Before doing that, we introduce one last order-related function: rank.
rankAlthough not as frequently used as order and sort, the function
rank is also related to order and can be useful. For any given vector
it returns a vector with the rank of the first entry, second entry,
etc., of the input vector. Here is a simple example:
x <- c(31, 4, 15, 92, 65)
rank(x)
#> [1] 3 1 2 5 4
To summarize, let’s look at the results of the three functions we have introduced:
| original | sort | order | rank |
|---|---|---|---|
| 31 | 4 | 2 | 3 |
| 4 | 15 | 3 | 1 |
| 15 | 31 | 1 | 2 |
| 92 | 65 | 5 | 5 |
| 65 | 92 | 4 | 4 |
Another common source of unnoticed errors in R is the use of recycling. We saw that vectors are added elementwise. So if the vectors don’t match in length, it is natural to assume that we should get an error. But we don’t. Notice what happens:
x <- c(1,2,3)
y <- c(10, 20, 30, 40, 50, 60, 70)
x+y
#> Warning in x + y: longer object length is not a multiple of shorter
#> object length
#> [1] 11 22 33 41 52 63 71
We do get a warning, but no error. For the output, R has recycled the
numbers in x. Notice the last digit of numbers in the output.