Although this book focuses almost exclusively on data analysis, data
management is also an important part of data science. As explained in
the introduction, we do not cover this topic. However, quite often data
analysts needs to collect data, or work with others collecting data, in
a way that is most conveniently stored in a spreadsheet. Although
filling out a spreadsheet by hand is a practice we highly discourage, we
instead recommend the process be automatized as much as possible,
sometimes you just have to do it. Therefore, in this section, we provide
recommendations on how to organize data in a spreadsheet. Although there
are R packages designed to read Microsoft Excel spreadsheets, we
generally want to avoid this format. Instead, we recommend Google Sheets
as a free software tool. Below we summarize the recommendations made in
paper by Karl Broman and Kara Woo. Please read the paper for
important details.
- Be Consistent - Before you commence entering data, have a plan.
Once you have a plan, be consistent and stick to it.
- Choose Good Names for Things - You want the names you pick for
objects, files, and directories to be memorable, easy to spell, and
descriptive. This is actually a hard balance to achieve and it does
require time and thought. One important rule to follow is do not
use spaces, use underscores
_
or dashes instead -
. Also, avoid
symbols; stick to letters and numbers.
- Write Dates as YYYY-MM-DD - To avoid confusion, we strongly
recommend using this global ISO 8601 standard.
- No Empty Cells - Fill in all cells and use some common code for
missing data.
- Put Just One Thing in a Cell - It is better to add columns to
store the extra information rather than having more than one piece
of information in one cell.
- Make It a Rectangle - The spreadsheet should be a rectangle.
- Create a Data Dictionary - If you need to explain things, such
as what the columns are or what the labels used for categorical
variables are, do this in a separate file.
- No Calculations in the Raw Data Files - Excel permits you to
perform calculations. Do not make this part of your spreadsheet.
Code for calculations should be in a script.
- Do Not Use Font Color or Highlighting as Data - Most import
functions are not able to import this information. Encode this
information as a variable instead.
- Make Backups - Make regular backups of your data.
- Use Data Validation to Avoid Errors - Leverage the tools in your
spreadsheet software so that the process is as error-free and
repetitive-stress-injury-free as possible.
- Save the Data as Text Files - Save files for sharing in comma or
tab delimited format.