Tidy data must be stored in data frames. We introduced the data frame in
Section 2.4 and have been using the
murders
data frame throughout the book. In Section 4.4 we introduced the group_by
function,
which permits stratifying data before computing summary statistics. But
where is the group information stored in the data frame?
murders %>% group_by(region)
#> # A tibble: 51 x 6
#> # Groups: region [4]
#> state abb region population total rate
#> <chr> <chr> <fct> <dbl> <dbl> <dbl>
#> 1 Alabama AL South 4779736 135 2.82
#> 2 Alaska AK West 710231 19 2.68
#> 3 Arizona AZ West 6392017 232 3.63
#> 4 Arkansas AR South 2915918 93 3.19
#> 5 California CA West 37253956 1257 3.37
#> # … with 46 more rows
Notice that there are no columns with this information. But, if you look
closely at the output above, you see the line A tibble
followed by
dimensions. We can learn the class of the returned object using:
murders %>% group_by(region) %>% class()
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
The tbl
, pronounced tibble, is a special kind of data frame. The
functions group_by
and summarize
always return this type of data
frame. The group_by
function returns a special kind of tbl
, the
grouped_df
. We will say more about these later. For consistency, the
dplyr manipulation verbs (select
, filter
, mutate
, and
arrange
) preserve the class of the input: if they receive a regular
data frame they return a regular data frame, while if they receive a
tibble they return a tibble. But tibbles are the preferred format in the
tidyverse and as a result tidyverse functions that produce a data frame
from scratch return a tibble. For example, in Chapter
5 we will see that tidyverse functions
used to import data create tibbles.
Tibbles are very similar to data frames. In fact, you can think of them as a modern version of data frames. Nonetheless there are three important differences which we describe next.
The print method for tibbles is more readable than that of a data frame.
To see this, compare the outputs of typing murders
and the output of
murders if we convert it to a tibble. We can do this using
as_tibble(murders)
. If using RStudio, output for a tibble adjusts to
your window size. To see this, change the width of your R console and
notice how more/less columns are shown.
If you subset the columns of a data frame, you may get back an object that is not a data frame, such as a vector or scalar. For example:
class(murders[,4])
#> [1] "numeric"
is not a data frame. With tibbles this does not happen:
class(as_tibble(murders)[,4])
#> [1] "tbl_df" "tbl" "data.frame"
This is useful in the tidyverse since functions require data frames as input.
With tibbles, if you want to access the vector that defines a column,
and not get back a data frame, you need to use the accessor $
:
class(as_tibble(murders)$population)
#> [1] "numeric"
A related feature is that tibbles will give you a warning if you try to
access a column that does not exist. If we accidentally write
Population
instead of population
this:
murders$Population
#> NULL
returns a NULL
with no warning, which can make it harder to debug. In
contrast, if we try this with a tibble we get an informative warning:
as_tibble(murders)$Population
#> Warning: Unknown or uninitialised column: `Population`.
#> NULL
While data frame columns need to be vectors of numbers, strings, or logical values, tibbles can have more complex objects, such as lists or functions. Also, we can create tibbles with functions:
tibble(id = c(1, 2, 3), func = c(mean, median, sd))
#> # A tibble: 3 x 2
#> id func
#> <dbl> <list>
#> 1 1 <fn>
#> 2 2 <fn>
#> 3 3 <fn>
The function group_by
returns a special kind of tibble: a grouped
tibble. This class stores information that lets you know which rows are
in which groups. The tidyverse functions, in particular the summarize
function, are aware of the group information.
tibble
instead of data.frame
It is sometimes useful for us to create our own data frames. To create a
data frame in the tibble format, you can do this by using the tibble
function.
grades <- tibble(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
Note that base R (without packages loaded) has a function with a very
similar name, data.frame
, that can be used to create a regular data
frame rather than a tibble. One other important difference is that by
default data.frame
coerces characters into factors without providing a
warning or message:
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
class(grades$names)
#> [1] "factor"
To avoid this, we use the rather cumbersome argument stringsAsFactors
:
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90),
stringsAsFactors = FALSE)
class(grades$names)
#> [1] "character"
To convert a regular data frame to a tibble, you can use the as_tibble
function.
as_tibble(grades) %>% class()
#> [1] "tbl_df" "tbl" "data.frame"