An important part of exploratory data analysis is summarizing data. The
average and standard deviation are two examples of widely used summary
statistics. More informative summaries can often be achieved by first
splitting data into groups. In this section, we cover two new dplyr
verbs that make these computations easier: summarize
and group_by
.
We learn to access resulting values using the pull
function.
summarize
The summarize
function in dplyr provides a way to compute summary
statistics with intuitive and readable code. We start with a simple
example based on heights. The heights
dataset includes heights and sex
reported by students in an in-class survey.
library(dplyr)
library(dslabs)
data(heights)
The following code computes the average and standard deviation for females:
s <- heights %>%
filter(sex == "Female") %>%
summarize(average = mean(height), standard_deviation = sd(height))
s
#> average standard_deviation
#> 1 64.9 3.76
This takes our original data table as input, filters it to keep only
females, and then produces a new summarized table with just the average
and the standard deviation of heights. We get to choose the names of the
columns of the resulting table. For example, above we decided to use
average
and standard_deviation
, but we could have used other names
just the same.
Because the resulting table stored in s
is a data frame, we can access
the components with the accessor $
:
s$average
#> [1] 64.9
s$standard_deviation
#> [1] 3.76
As with most other dplyr functions, summarize
is aware of the
variable names and we can use them directly. So when inside the call to
the summarize
function we write mean(height)
, the function is
accessing the column with the name “height” and then computing the
average of the resulting numeric vector. We can compute any other
summary that operates on vectors and returns a single value. For
example, we can add the median, minimum, and maximum heights like this:
heights %>%
filter(sex == "Female") %>%
summarize(median = median(height), minimum = min(height),
maximum = max(height))
#> median minimum maximum
#> 1 65 51 79
We can obtain these three values with just one line using the quantile
function: for example, quantile(x, c(0,0.5,1))
returns the min (0th
percentile), median (50th percentile), and max (100th percentile) of the
vector x
. However, if we attempt to use this function we get the following:
heights %>%
filter(sex == "Female") %>%
summarize(range = quantile(height, c(0, 0.5, 1)))
#> range
#> 1 51.00000
#> 2 64.98031
#> 3 79.00000
All the values are in one column. If we want to have the same result as above we have to add a column which contains the statistic and then transform the dataframe from long format to wide format with pivot_wider
.
heights %>%
filter(sex == "Female") %>%
summarize(range = quantile(height, c(0, 0.5, 1)),
statistic = c("min","median","max")) %>%
pivot_wider(names_from = statistic, values_from = range)
In Section 4.8 we will learn how we can do the above with the do
function.
For another example of how we can use the summarize
function, let’s
compute the average murder rate for the United States. Remember our data
table includes total murders and population size for each state and we
have already used dplyr to add a murder rate column:
murders <- murders %>% mutate(rate = total/population*100000)
Remember that the US murder rate is not the average of the state murder rates:
summarize(murders, mean(rate))
#> mean(rate)
#> 1 2.78
This is because in the computation above the small states are given the same weight as the large ones. The US murder rate is the total number of murders in the US divided by the total US population. So the correct computation is:
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 100000)
us_murder_rate
#> rate
#> 1 3.03
This computation counts larger states proportionally to their size which results in a larger value.
pull
Note: In the video at the top of the page they used the dot operator instead of the pull function. The functionality of these 2 methods is very similar and we will introduce the dot operator in a later section.
The us_murder_rate
object defined above represents just one number.
Yet we are storing it in a data frame:
class(us_murder_rate)
#> [1] "data.frame"
since, as most dplyr functions, summarize
always returns a data
frame.
This might be problematic if we want to use this result with functions
that require a numeric value. Here we show a useful trick for accessing
values stored in data when using pipes: when a data object is piped that
object and its columns can be accessed using the pull
function. To
understand what we mean take a look at this line of code:
us_murder_rate %>% pull(rate)
#> [1] 3.03
This returns the value in the rate
column of us_murder_rate
making
it equivalent to us_murder_rate$rate
.
To get a number from the original data table with one line of code we can type:
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 100000) %>%
pull(rate)
us_murder_rate
#> [1] 3.03
which is now a numeric:
class(us_murder_rate)
#> [1] "numeric"
group_by
A common operation in data exploration is to first split data into
groups and then compute summaries for each group. For example, we may
want to compute the average and standard deviation for men’s and women’s
heights separately. The group_by
function helps us do this.
If we type this:
heights %>% group_by(sex)
#> # A tibble: 1,050 x 2
#> # Groups: sex [2]
#> sex height
#> <fct> <dbl>
#> 1 Male 75
#> 2 Male 70
#> 3 Male 68
#> 4 Male 74
#> 5 Male 61
#> # … with 1,045 more rows
The result does not look very different from heights
, except we see
Groups: sex [2]
when we print the object. Although not immediately
obvious from its appearance, this is now a special data frame called a
grouped data frame, and dplyr functions, in particular
summarize
, will behave differently when acting on this object.
Conceptually, you can think of this table as many tables, with the same
columns but not necessarily the same number of rows, stacked together in
one object. When we summarize the data after grouping, this is what
happens:
heights %>%
group_by(sex) %>%
summarize(average = mean(height), standard_deviation = sd(height))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 3
#> sex average standard_deviation
#> <fct> <dbl> <dbl>
#> 1 Female 64.9 3.76
#> 2 Male 69.3 3.61
The summarize
function applies the summarization to each group
separately.
For another example, let’s compute the median murder rate in the four regions of the country:
murders %>%
group_by(region) %>%
summarize(median_rate = median(rate))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 4 x 2
#> region median_rate
#> <fct> <dbl>
#> 1 Northeast 1.80
#> 2 South 3.40
#> 3 North Central 1.97
#> 4 West 1.29