For this exercise the olive.csv file is available in the working directory.

Working in RStudio

If you want to make this exercise locally you can dowload the file in your working directory using the following command:
download.file("https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/olive.csv", "olive.csv")

When we use the read_csv function from the tidyverse library to read the olive.csv file we notice a warning that mentions parsing issues.

read_csv("olive.csv")

# New names: 
# ' ' > ...1
# Rows: 572 Columns: 11
# Column specification
# Delimiter: ","
# chr (2): Region, eicosenoic
# dbl (9): ...1, Area, palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic

# A tibble: 572 × 11
#    ...1 Region        Area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
#   <dbl> <chr>        <dbl>    <dbl>       <dbl>   <dbl> <dbl>    <dbl>     <dbl>     <dbl> <chr>     
# 1     1 North-Apulia     1        1        1075      75   226     7823       672        36 60,29     
# 2     2 North-Apulia     1        1        1088      73   224     7709       781        31 61,29     
# 3     3 North-Apulia     1        1         911      54   246     8113       549        31 63,29     
# 4     4 North-Apulia     1        1         966      57   240     7952       619        50 78,35     
 
#Warning message:
#One or more parsing issues, see `problems()` for details 

When we run problems() we see the following.

# problems()
#
# A tibble: 572 × 5
#     row   col expected   actual     file                                                                  
#   <int> <int> <chr>      <chr>      <chr>                                                             
# 1     2    12 11 columns 12 columns /User/path
# 2     3    12 11 columns 12 columns /User/path
# 3     4    12 11 columns 12 columns /User/path

This tells us that R expected 11 columns but actually received 12. To understand what is happening we need to look at the original file. Go to this site1. There we see that there are 11 column names specified in the first line. The first column name is empty, which is why we get the new_names warning as R created a name for that column. However we see that there are 12 values in all other lines. As R only received 11 column names, it merged the last two columns. The eicosenoic now incorrectly consists of two values.

Exercise

  1. Read the help file for read_csv to figure out how to read in the olive.csv file without reading this header. By skipping the first line we avoid the issues mentioned above. Save the result to an object called dat.

    Hint

    You will need to add the skip and col_names parmaters to the read_csv function call.

  2. A problem with the previous approach is that we don’t receive the column names as we skipped the header line. Run:

     names(dat)
    

    to see that the names are not informative. Use the read_lines function to read in just the first line from the olive.csv file. Store your result in header_line. (Hint: use the n_max parameter). We can then use the header_line object to manually add column names, but this is not needed for this exercise.


Note: don’t forget to load the readr package