The first step when importing data from a spreadsheet is to locate the file containing the data. Although we do not recommend it, you can use an approach similar to what you do to open files in Microsoft Excel by clicking on the RStudio “File” menu, clicking “Import Dataset”, then clicking through folders until you find the file. We want to be able to write code rather than use the point-and-click approach. The keys and concepts we need to learn to do this are described in detail in the Productivity Tools part of this book. Here we provide an overview of the very basics.
The main challenge in this first step is that we need to let the R functions doing the importing know where to look for the file containing the data. The simplest way to do this is to have a copy of the file in the folder in which the importing functions look by default. Once we do this, all we have to supply to the importing function is the filename.
A spreadsheet containing the US murders data is included as part of the dslabs package. Finding this file is not straightforward, but the following lines of code copy the file to the folder in which R looks in by default. We explain how these lines work below.
filename <- "murders.csv"
dir <- system.file("extdata", package = "dslabs")
fullpath <- file.path(dir, filename)
file.copy(fullpath, "murders.csv")
This code does not read the data into R, it just copies a file. But once
the file is copied, we can import the data with a simple line of code.
Here we use the read_csv
function from the readr package, which is
part of the tidyverse.
library(tidyverse)
dat <- read_csv(filename)
The data is imported and stored in dat
. The rest of this section
defines some important concepts and provides an overview of how we write
code that tells R how to find the files we want to import.
You can think of your computer’s filesystem as a series of nested folders, each containing other folders and files. Data scientists refer to folders as directories. We refer to the folder that contains all other folders as the root directory. We refer to the directory in which we are currently located as the working directory. The working directory therefore changes as you move through folders: think of it as your current location.
The path of a file is a list of directory names that can be thought of as instructions on what folders to click on, and in what order, to find the file. If these instructions are for finding the file from the root directory we refer to it as the full path. If the instructions are for finding the file starting in the working directory we refer to it as a relative path.
To see an example of a full path on your system type the following:
system.file(package = "dslabs")
The strings separated by slashes are the directory names. The first
slash represents the root directory and we know this is a full path
because it starts with a slash. If the first directory name appears
without a slash in front, then the path is assumed to be relative. We
can use the function list.files
to see examples of relative paths.
dir <- system.file(package = "dslabs")
list.files(path = dir)
#> [1] "data" "DESCRIPTION" "extdata" "help"
#> [5] "html" "INDEX" "Meta" "NAMESPACE"
#> [9] "R" "script"
These relative paths give us the location of the files or directories if
we start in the directory with the full path. For example, the full path
to the help
directory in the example above is
/Library/Frameworks/R.framework/Versions/3.5/Resources/library/dslabs/help
.
Note: You will probably not make much use of the system.file
function in your day-to-day data analysis work. We introduce it in this
section because it facilitates the sharing of spreadsheets by including
them in the dslabs package. You will rarely have the luxury of data
being included in packages you already have installed. However, you will
frequently need to navigate full and relative paths and import
spreadsheet formatted data.
We highly recommend only writing relative paths in your code. The reason
is that full paths are unique to your computer and you want your code to
be portable. You can get the full path of your working directory without
writing out explicitly by using the getwd
function.
wd <- getwd()
If you need to change your working directory, you can use the function
setwd
or you can change it through RStudio by clicking on “Session”.
Another example of obtaining a full path without writing out explicitly
was given above when we created the object fullpath
like this:
filename <- "murders.csv"
dir <- system.file("extdata", package = "dslabs")
fullpath <- file.path(dir, filename)
The function system.file
provides the full path of the folder
containing all the files and directories relevant to the package
specified by the package
argument. By exploring the directories in
dir
we find that the extdata
contains the file we want:
dir <- system.file(package = "dslabs")
filename %in% list.files(file.path(dir, "extdata"))
#> [1] TRUE
The system.file
function permits us to provide a subdirectory as a
first argument, so we can obtain the fullpath of the extdata
directory
like this:
dir <- system.file("extdata", package = "dslabs")
The function file.path
is used to combine directory names to produce
the full path of the file we want to import.
fullpath <- file.path(dir, filename)
The final line of code we used to copy the file into our home directory
used
the function file.copy
. This function takes two arguments: the file to
copy and the name to give it in the new directory.
file.copy(fullpath, "murders.csv")
#> [1] TRUE
If a file is copied successfully, the file.copy
function returns
TRUE
. Note that we are giving the file the same name, murders.csv
,
but we could have named it anything. Also note that by not starting the
string with a slash, R assumes this is a relative path and copies the
file to the working directory.
You should be able to see the file in your working directory and can check by using:
list.files()