The demand for skilled data science practitioners in industry, academia,
and government is rapidly growing. This book introduces concepts and
skills that can help you tackle real-world data analysis challenges. It
covers concepts from probability, statistical inference, linear
regression, and machine learning. It also helps you develop skills such
as R programming, data wrangling with **dplyr**, data visualization with
**ggplot2**, algorithm building with **caret**, file organization with
UNIX/Linux shell, version control with Git and GitHub, and reproducible
document preparation with **knitr** and R markdown. The book is divided
into six parts: **R**, **Data Visualization**, **Data Wrangling**,
**Statistics with R**, **Machine Learning**, and **Productivity Tools**.
Each part has several chapters meant to be presented as one lecture and
includes dozens of exercises distributed across chapters.

Throughout the book, we use motivating case studies. In each case study, we try to realistically mimic a data scientistâ€™s experience. For each of the concepts covered, we start by asking specific questions and answer these through data analysis. We learn the concepts as a means to answer the questions. Examples of the case studies included in the book are:

Case Study | Concept |
---|---|

US murder rates by state | R Basics |

Student heights | Statistical Summaries |

Trends in world health and economics | Data Visualization |

The impact of vaccines on infectious disease rates | Data Visualization |

The financial crisis of 2007-2008 | Probability |

Election forecasting | Statistical Inference |

Reported student heights | Data Wrangling |

Money Ball: Building a baseball team | Linear Regression |

MNIST: Image processing hand-written digits | Machine Learning |

Movie recommendation systems | Machine Learning |

This book is meant to be a textbook for a first course in Data Science. No previous knowledge of R is necessary, although some experience with programming may be helpful. The statistical concepts used to answer the case study questions are only briefly introduced, so a Probability and Statistics textbook is highly recommended for in-depth understanding of these concepts. If you read and understand all the chapters and complete all the exercises, you will be well-positioned to perform basic data analysis tasks and you will be prepared to learn the more advanced concepts and skills needed to become an expert.

We start by going over the **basics of R** and the **tidyverse**. You
learn R throughout the book, but in the first part we go over the
building blocks needed to keep learning.

The growing availability of informative datasets and software tools has
led to increased reliance on **data visualizations** in many fields. In
the second part we demonstrate how to use **ggplot2** to generate graphs
and describe important data visualization principles.

In the third part we demonstrate the importance of statistics in data
analysis by answering case study questions using **probability,
inference, and regression** with R.

The fourth part uses several examples to familiarize the reader with
**data wrangling**. Among the specific skills we learn are web scraping,
using regular expressions, and joining and reshaping data tables. We do
this using **tidyverse** tools.

In the fifth part we present several challenges that lead us to
introduce **machine learning**. We learn to use the **caret** package to
build prediction algorithms including K-nearest neighbors and random
forests.

In the final part, we provide a brief introduction to the **productivity
tools** we use on a day-to-day basis in data science projects. These are
RStudio, UNIX/Linux shell, Git and GitHub, and **knitr** and R Markdown.

This book focuses on the data analysis aspects of data science. We therefore do not cover aspects related to data management or engineering. Although R programming is an essential part of the book, we do not teach more advanced computer science topics such as data structures, optimization, and algorithm theory. Similarly, we do not cover topics such as web services, interactive graphics, parallel computing, and data streaming processing. The statistical concepts are presented mainly as tools to solve problems and in-depth theoretical descriptions are not included in this book.