For these exercises, we will be using the data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s. Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and they complete the health examination component of the survey. Part of the data is made available via the NHANES package. Once you install the NHANES package, you can load the data like this:

library(dplyr)
library(NHANES)
data(NHANES)

The NHANES dataframe has many missing values. The mean and sd functions in R will return NA if any of the entries of the input vector is an NA. Here is an example:

Example

library(dslabs)
data(na_example)
mean(na_example)
#> [1] NA
sd(na_example)
#> [1] NA

To ignore the NA’s most functions have an na.rm argument, which stands for remove NA:

mean(na_example, na.rm = TRUE)
#> [1] 2.3
sd(na_example, na.rm = TRUE)
#> [1] 1.22

Let’s now explore the NHANES data.

Exercise

We will provide some basic facts about blood pressure. First let’s select a group to set the standard. We will use 20-to-29-year-old females.

  1. make a summary dataframe called summary from the NHANES dataframe. The dataframe should contain a mean and a sd column, respectively holding the mean and standard deviation of the blood pressure (BPSysAve) of female persons (Gender) aged 20 to 29 year old (AgeDecade).

    Hint

    Use filter and summarize functions and use the na.rm = TRUE argument when computing the average and standard deviation

    Pitfall

    Note that the categorical variables in theAgeDecade column is coded like " 20-29",with a space in front!

    Note that in the Gender column female and male are written without capital letters

  2. Using a pipe, assign the mean extracted from the dataframe created in question 1 to a numeric variable ref_mean. You can do this using the pull function.