In this section we introduce the main tidyverse data importing
functions. We will use the murders.csv
file provided by the dslabs
package as an example. To simplify the illustration we will copy the
file to our working directory using the following code:
filename <- "murders.csv"
dir <- system.file("extdata", package = "dslabs")
fullpath <- file.path(dir, filename)
file.copy(fullpath, "murders.csv")
The readr library includes functions for reading data stored in text file spreadsheets into R. readr is part of the tidyverse package, or you can load it directly:
library(readr)
The following functions are available to read-in spreadsheets:
Function | Format | Typical suffix |
---|---|---|
read_table | white space separated values | txt |
read_csv | comma separated values | csv |
read_csv2 | semicolon separated values | csv |
read_tsv | tab delimited separated values | tsv |
read_delim | general text file format, must define delimiter | txt |
Although the suffix usually tells us what type of file it is, there is
no guarantee that these always match. We can open the file to take a
look or use the function read_lines
to look at a few lines:
read_lines("murders.csv", n_max = 3)
#> [1] "state,abb,region,population,total"
#> [2] "Alabama,AL,South,4779736,135"
#> [3] "Alaska,AK,West,710231,19"
This also shows that there is a header. Now we are ready to read-in the
data into R. From the .csv suffix and the peek at the file, we know to
use read_csv
:
dat <- read_csv(filename)
#> Parsed with column specification:
#> cols(
#> state = col_character(),
#> abb = col_character(),
#> region = col_character(),
#> population = col_double(),
#> total = col_double()
#> )
Note that we receive a message letting us know what data types were used
for each column. Also note that dat
is a tibble
, not just a data
frame. This is because read_csv
is a tidyverse parser. We can
confirm that the data has in fact been read-in with:
View(dat)
Finally, note that we can also use the full path for the file:
dat <- read_csv(fullpath)
You can load the readxl package using
library(readxl)
The package provides functions to read-in Microsoft Excel formats:
Function | Format | Typical suffix |
---|---|---|
read_excel | auto detect the format | xls, xlsx |
read_xls | original format | xls |
read_xlsx | new format | xlsx |
The Microsoft Excel formats permit you to have more than one spreadsheet
in one file. These are referred to as sheets. The functions listed
above read the first sheet by default, but we can also read the others.
The excel_sheets
function gives us the names of all the sheets in an
Excel file. These names can then be passed to the sheet
argument in
the three functions above to read sheets other than the first.