Now that we have mastered some basic R knowledge, let’s try to gain some insights into the safety of different states in the context of gun murders.
sort
Say we want to rank the states from least to most gun murders. The
function sort
sorts a vector in increasing order. We can therefore see
the largest number of gun murders by typing:
library(dslabs)
data(murders)
sort(murders$total)
#> [1] 2 4 5 5 7 8 11 12 12 16 19 21 22
#> [14] 27 32 36 38 53 63 65 67 84 93 93 97 97
#> [27] 99 111 116 118 120 135 142 207 219 232 246 250 286
#> [40] 293 310 321 351 364 376 413 457 517 669 805 1257
However, this does not give us information about which states have which murder totals. For example, we don’t know which state had 1257.
order
The function order
is closer to what we want. It takes a vector as
input and returns the vector of indexes that sorts the input vector.
This may sound confusing so let’s look at a simple example. We can
create a vector and sort it:
x <- c(31, 4, 15, 92, 65)
sort(x)
#> [1] 4 15 31 65 92
Rather than sort the input vector, the function order
returns the
index that sorts input vector:
index <- order(x)
x[index]
#> [1] 4 15 31 65 92
This is the same output as that returned by sort(x)
. If we look at
this index, we see why it works:
x
#> [1] 31 4 15 92 65
order(x)
#> [1] 2 3 1 5 4
The second entry of x
is the smallest, so order(x)
starts with 2
.
The next smallest is the third entry, so the second entry is 3
and so
on.
How does this help us order the states by murders? First, remember that
the entries of vectors you access with $
follow the same order as the
rows in the table. For example, these two vectors containing state names
and abbreviations, respectively, are matched by their order:
murders$state[1:6]
#> [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
#> [6] "Colorado"
murders$abb[1:6]
#> [1] "AL" "AK" "AZ" "AR" "CA" "CO"
This means we can order the state names by their total murders. We first obtain the index that orders the vectors according to murder totals and then index the state names vector:
ind <- order(murders$total)
murders$abb[ind]
#> [1] "VT" "ND" "NH" "WY" "HI" "SD" "ME" "ID" "MT" "RI" "AK" "IA" "UT"
#> [14] "WV" "NE" "OR" "DE" "MN" "KS" "CO" "NM" "NV" "AR" "WA" "CT" "WI"
#> [27] "DC" "OK" "KY" "MA" "MS" "AL" "IN" "SC" "TN" "AZ" "NJ" "VA" "NC"
#> [40] "MD" "OH" "MO" "LA" "IL" "GA" "MI" "PA" "NY" "FL" "TX" "CA"
According to the above, California had the most murders.
max
and which.max
If we are only interested in the entry with the largest value, we can
use max
for the value:
max(murders$total)
#> [1] 1257
and which.max
for the index of the largest value:
i_max <- which.max(murders$total)
murders$state[i_max]
#> [1] "California"
For the minimum, we can use min
and which.min
in the same way.
Does this mean California is the most dangerous state? In an upcoming
section, we argue that we should be considering rates instead of totals.
Before doing that, we introduce one last order-related function: rank
.
rank
Although not as frequently used as order
and sort
, the function
rank
is also related to order and can be useful. For any given vector
it returns a vector with the rank of the first entry, second entry,
etc., of the input vector. Here is a simple example:
x <- c(31, 4, 15, 92, 65)
rank(x)
#> [1] 3 1 2 5 4
To summarize, let’s look at the results of the three functions we have introduced:
original | sort | order | rank |
---|---|---|---|
31 | 4 | 2 | 3 |
4 | 15 | 3 | 1 |
15 | 31 | 1 | 2 |
92 | 65 | 5 | 5 |
65 | 92 | 4 | 4 |
Another common source of unnoticed errors in R is the use of recycling. We saw that vectors are added elementwise. So if the vectors don’t match in length, it is natural to assume that we should get an error. But we don’t. Notice what happens:
x <- c(1,2,3)
y <- c(10, 20, 30, 40, 50, 60, 70)
x+y
#> Warning in x + y: longer object length is not a multiple of shorter
#> object length
#> [1] 11 22 33 41 52 63 71
We do get a warning, but no error. For the output, R has recycled the
numbers in x
. Notice the last digit of numbers in the output.