For these exercises, we will be using the data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s. Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and they complete the health examination component of the survey. Part of the data is made available via the NHANES package. Once you install the NHANES package, you can load the data like this:
library(NHANES)
data(NHANES)
The NHANES data has many missing values. The mean and sd
functions in R will return NA if any of the entries of the input
vector is an NA. Here is an example:
library(dslabs)
data(na_example)
mean(na_example)
#> [1] NA
sd(na_example)
#> [1] NA
To ignore the NA’s we can use the na.rm argument:
mean(na_example, na.rm = TRUE)
#> [1] 2.3
sd(na_example, na.rm = TRUE)
#> [1] 1.22
Let’s now explore the NHANES data.
1. We will provide some basic facts about blood pressure. First let’s
select a group to set the standard. We will use 20-to-29-year-old females.
AgeDecade is a categorical variable with these ages. Note
that the category is coded like “ 20-29”, with a space in front! What
is the average and standard deviation of systolic blood pressure (the blood pressure can be found in theBPSysAve column)? The result should be a data frame with a mean and sd column. Store this dataframe as summary.
Hint: Use filter and summarize and use the na.rm = TRUE argument
when computing the average and standard deviation. You can also filter
the NA values using filter.
2. Using a pipe, assign the average to a numeric variable ref_mean.
Hint: Use the code similar to above and then pull.
3. Now do the same thing, but instead of calculating the mean and
standard deviation, report the minimum and maximum values (again for
20-to-29-year-old females). This time store the result in a data frame
min_max with a min and a max column.