For these exercises, we will be using the data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s. Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and they complete the health examination component of the survey. Part of the data is made available via the NHANES package. Once you install the NHANES package, you can load the data like this:
library(dplyr)
library(NHANES)
data(NHANES)
The NHANES dataframe has many missing values. The mean and sd
functions in R will return NA if any of the entries of the input
vector is an NA. Here is an example:
library(dslabs)
data(na_example)
mean(na_example)
#> [1] NA
sd(na_example)
#> [1] NA
To ignore the NA’s most functions have an na.rm argument, which stands for remove NA:
mean(na_example, na.rm = TRUE)
#> [1] 2.3
sd(na_example, na.rm = TRUE)
#> [1] 1.22
Let’s now explore the NHANES data.
We will provide some basic facts about blood pressure. First let’s select a group to set the standard. We will use 20-to-29-year-old females.
make a summary dataframe called summary from the NHANES dataframe. The dataframe should contain a mean and a sd column, respectively holding the mean and standard deviation of the blood pressure (BPSysAve) of female persons (Gender) aged 20 to 29 year old (AgeDecade).
Hint
Use
filterandsummarizefunctions and use thena.rm = TRUEargument when computing the average and standard deviation
Pitfall
Note that the categorical variables in the
AgeDecadecolumn is coded like" 20-29",with a space in front!Note that in the
Gendercolumnfemaleandmaleare written without capital letters
Using a pipe, assign the mean extracted from the dataframe created in question 1 to a numeric variable ref_mean. You can do this using the pull function.