When examining a dataset, it is often convenient to sort the table by
the different columns. We know about the order
and sort
function,
but for ordering entire tables, the dplyr function arrange
is
useful. For example, here we order the states by population size:
murders %>%
arrange(population) %>%
head()
#> state abb region population total rate
#> 1 Wyoming WY West 563626 5 0.887
#> 2 District of Columbia DC South 601723 99 16.453
#> 3 Vermont VT Northeast 625741 2 0.320
#> 4 North Dakota ND North Central 672591 4 0.595
#> 5 Alaska AK West 710231 19 2.675
#> 6 South Dakota SD North Central 814180 8 0.983
With arrange
we get to decide which column to sort by. To see the
states by murder rate, from lowest to highest, we arrange by rate
instead:
murders %>%
arrange(rate) %>%
head()
#> state abb region population total rate
#> 1 Vermont VT Northeast 625741 2 0.320
#> 2 New Hampshire NH Northeast 1316470 5 0.380
#> 3 Hawaii HI West 1360301 7 0.515
#> 4 North Dakota ND North Central 672591 4 0.595
#> 5 Iowa IA North Central 3046355 21 0.689
#> 6 Idaho ID West 1567582 12 0.766
Note that the default behavior is to order in ascending order. In
dplyr, the function desc
transforms a vector so that it is in
descending order. To sort the table in descending order, we can type:
murders %>%
arrange(desc(rate))
If we are ordering by a column with ties, we can use a second column to
break the tie. Similarly, a third column can be used to break ties
between first and second and so on. Here we order by region
, then
within region we order by murder rate:
murders %>%
arrange(region, rate) %>%
head()
#> state abb region population total rate
#> 1 Vermont VT Northeast 625741 2 0.320
#> 2 New Hampshire NH Northeast 1316470 5 0.380
#> 3 Maine ME Northeast 1328361 11 0.828
#> 4 Rhode Island RI Northeast 1052567 16 1.520
#> 5 Massachusetts MA Northeast 6547629 118 1.802
#> 6 New York NY Northeast 19378102 517 2.668
In the code above, we have used the function head
to avoid having the
page fill up with the entire dataset. If we want to see a larger
proportion, we can use the top_n
function. This function takes a data
frame as it’s first argument, the number of rows to show in the second,
and the variable to filter by in the third. Here is an example of how to
see the top 5 rows:
murders %>% top_n(5, rate)
#> state abb region population total rate
#> 1 District of Columbia DC South 601723 99 16.45
#> 2 Louisiana LA South 4533372 351 7.74
#> 3 Maryland MD South 5773552 293 5.07
#> 4 Missouri MO North Central 5988927 321 5.36
#> 5 South Carolina SC South 4625364 207 4.48
Note that rows are not sorted by rate
, only filtered. If we want to
sort, we need to use arrange
. Note that if the third argument is left
blank, top_n
filters by the last column.