For this exercise the olive.csv
file is available in the working directory.
Working in RStudio
If you want to make this exercise locally you can dowload the file in your working directory using the following command:
download.file("https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/olive.csv", "olive.csv")
When we use the read_csv
function from the tidyverse library to read the olive.csv
file we notice a warning that mentions parsing issues.
read_csv("olive.csv")
# New names:
# ' ' > ...1
# Rows: 572 Columns: 11
# Column specification
# Delimiter: ","
# chr (2): Region, eicosenoic
# dbl (9): ...1, Area, palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic
# A tibble: 572 × 11
# ...1 Region Area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 1 North-Apulia 1 1 1075 75 226 7823 672 36 60,29
# 2 2 North-Apulia 1 1 1088 73 224 7709 781 31 61,29
# 3 3 North-Apulia 1 1 911 54 246 8113 549 31 63,29
# 4 4 North-Apulia 1 1 966 57 240 7952 619 50 78,35
#Warning message:
#One or more parsing issues, see `problems()` for details
When we run problems()
we see the following.
# problems()
#
# A tibble: 572 × 5
# row col expected actual file
# <int> <int> <chr> <chr> <chr>
# 1 2 12 11 columns 12 columns /User/path
# 2 3 12 11 columns 12 columns /User/path
# 3 4 12 11 columns 12 columns /User/path
This tells us that R expected 11 columns but actually received 12. To understand what is happening we need to look at the original file. Go to this site1. There we see that there are 11 column names specified in the first line. The first column name is empty, which is why we get the new_names warning as R created a name for that column. However we see that there are 12 values in all other lines. As R only received 11 column names, it merged the last two columns. The eicosenoic now incorrectly consists of two values.
Read the help file for read_csv
to figure out how to read in the olive.csv
file without reading this header. By skipping the first line we avoid the issues mentioned above. Save the result to an object called dat
.
Hint
You will need to add the
skip
andcol_names
parmaters to theread_csv
function call.
A problem with the previous approach is that we don’t receive the column names as we skipped the header line. Run:
names(dat)
to see that the names are not informative. Use the read_lines
function to read in just the first line from the olive.csv
file. Store your result in header_line
. (Hint: use the n_max
parameter). We can then use the header_line
object to manually add column names, but this is not needed for this exercise.
Note: don’t forget to load the readr package