For these exercises, we will be using the data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s. Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and they complete the health examination component of the survey. Part of the data is made available via the NHANES package. Once you install the NHANES package, you can load the data like this:
library(dplyr)
library(NHANES)
data(NHANES)
The NHANES dataframe has many missing values. The mean
and sd
functions in R will return NA
if any of the entries of the input
vector is an NA
. Here is an example:
library(dslabs)
data(na_example)
mean(na_example)
#> [1] NA
sd(na_example)
#> [1] NA
To ignore the NA
’s most functions have an na.rm
argument, which stands for remove NA:
mean(na_example, na.rm = TRUE)
#> [1] 2.3
sd(na_example, na.rm = TRUE)
#> [1] 1.22
Let’s now explore the NHANES data.
We will provide some basic facts about blood pressure. First let’s select a group to set the standard. We will use 20-to-29-year-old females.
make a summary dataframe called summary
from the NHANES dataframe. The dataframe should contain a mean
and a sd
column, respectively holding the mean and standard deviation of the blood pressure (BPSysAve
) of female persons (Gender
) aged 20 to 29 year old (AgeDecade
).
Hint
Use
filter
andsummarize
functions and use thena.rm = TRUE
argument when computing the average and standard deviation
Pitfall
Note that the categorical variables in the
AgeDecade
column is coded like" 20-29"
,with a space in front!Note that in the
Gender
columnfemale
andmale
are written without capital letters
Using a pipe, assign the mean
extracted from the dataframe created in question 1 to a numeric variable ref_mean
. You can do this using the pull function.