For this exercise the olive.csv file is available in the working directory.
Working in RStudio
If you want to make this exercise locally you can dowload the file in your working directory using the following command:
download.file("https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/olive.csv", "olive.csv")
When we use the read_csv function from the tidyverse library to read the olive.csv file we notice a warning that mentions parsing issues.
read_csv("olive.csv")
# New names:
# ' ' > ...1
# Rows: 572 Columns: 11
# Column specification
# Delimiter: ","
# chr (2): Region, eicosenoic
# dbl (9): ...1, Area, palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic
# A tibble: 572 × 11
# ...1 Region Area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 1 North-Apulia 1 1 1075 75 226 7823 672 36 60,29
# 2 2 North-Apulia 1 1 1088 73 224 7709 781 31 61,29
# 3 3 North-Apulia 1 1 911 54 246 8113 549 31 63,29
# 4 4 North-Apulia 1 1 966 57 240 7952 619 50 78,35
#Warning message:
#One or more parsing issues, see `problems()` for details
When we run problems() we see the following.
# problems()
#
# A tibble: 572 × 5
# row col expected actual file
# <int> <int> <chr> <chr> <chr>
# 1 2 12 11 columns 12 columns /User/path
# 2 3 12 11 columns 12 columns /User/path
# 3 4 12 11 columns 12 columns /User/path
This tells us that R expected 11 columns but actually received 12. To understand what is happening we need to look at the original file. Go to this site1. There we see that there are 11 column names specified in the first line. The first column name is empty, which is why we get the new_names warning as R created a name for that column. However we see that there are 12 values in all other lines. As R only received 11 column names, it merged the last two columns. The eicosenoic now incorrectly consists of two values.
Read the help file for read_csv to figure out how to read in the olive.csv file without reading this header. By skipping the first line we avoid the issues mentioned above. Save the result to an object called dat.
Hint
You will need to add the
skipandcol_namesparmaters to theread_csvfunction call.
A problem with the previous approach is that we don’t receive the column names as we skipped the header line. Run:
names(dat)
to see that the names are not informative. Use the read_lines function to read in just the first line from the olive.csv file. Store your result in header_line. (Hint: use the n_max parameter). We can then use the header_line object to manually add column names, but this is not needed for this exercise.
Note: don’t forget to load the readr package