R-base also provides import functions. These have similar names to those
in the tidyverse, for example read.table
, read.csv
and
read.delim
. However, there are a couple of important differences. To
show this we read-in the data with an R-base function:
dat2 <- read.csv(filename)
An important difference is that the characters are converted to factors:
class(dat2$abb)
#> [1] "factor"
class(dat2$region)
#> [1] "factor"
This can be avoided by setting the argument stringsAsFactors
to
FALSE
.
dat <- read.csv("murders.csv", stringsAsFactors = FALSE)
class(dat$state)
#> [1] "character"
In our experience this can be a cause for confusion since a variable
that was saved as characters in file is converted to factors regardless
of what the variable represents. In fact, we highly recommend
setting stringsAsFactors=FALSE
to be your default approach when using
the R-base parsers. You can easily convert the desired columns to
factors after importing data.
scan
When reading in spreadsheets many things can go wrong. The file might have a multiline header, be missing cells, or it might use an unexpected encoding1. We recommend you read this post about common issues found here: the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses2
With experience you will learn how to deal with different challenges. Carefully reading the help files for the functions discussed here will be useful. With scan you can read-in each cell of a file. Here is an example:
path <- system.file("extdata", package = "dslabs")
filename <- "murders.csv"
x <- scan(file.path(path, filename), sep=",", what = "c")
x[1:10]
#> [1] "state" "abb" "region" "population" "total"
#> [6] "Alabama" "AL" "South" "4779736" "135"
Note that the tidyverse provides read_lines
, a similarly useful
function.