We say that a data table is in tidy format if each row represents one
observation and columns represent the different variables available for
each of these observations. The murders
dataset is an example of a
tidy data frame.
#> state abb region population total
#> 1 Alabama AL South 4779736 135
#> 2 Alaska AK West 710231 19
#> 3 Arizona AZ West 6392017 232
#> 4 Arkansas AR South 2915918 93
#> 5 California CA West 37253956 1257
#> 6 Colorado CO West 5029196 65
Each row represent a state with each of the five columns providing a different variable related to these states: name, abbreviation, region, population, and total murders.
To see how the same information can be provided in different formats, consider the following example:
#> country year fertility
#> 1 Germany 1960 2.41
#> 2 South Korea 1960 6.16
#> 3 Germany 1961 2.44
#> 4 South Korea 1961 5.99
#> 5 Germany 1962 2.47
#> 6 South Korea 1962 5.79
This tidy dataset provides fertility rates for two countries across the years. This is a tidy dataset because each row presents one observation with the three variables being country, year, and fertility rate. However, this dataset originally came in another format and was reshaped for the dslabs package. Originally, the data was in the following format:
#> country 1960 1961 1962
#> 1 Germany 2.41 2.44 2.47
#> 2 South Korea 6.16 5.99 5.79
The same information is provided, but there are two important
differences in the format: 1) each row includes several observations and
2) one of the variables, year, is stored in the header. For the
tidyverse packages to be optimally used, data need to be reshaped into
tidy
format, which you will learn to do in the Data Wrangling part of
the book. Until then, we will use example datasets that are already in
tidy format.
Although not immediately obvious, as you go through the book you will start to appreciate the advantages of working in a framework in which functions use tidy formats for both inputs and outputs. You will see how this permits the data analyst to focus on more important aspects of the analysis rather than the format of the data.