For these exercises, we will be using the data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s. Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and they complete the health examination component of the survey. Part of the data is made available via the NHANES package. Once you install the NHANES package, you can load the data like this:
library(NHANES)
data(NHANES)
The NHANES data has many missing values. The mean
and sd
functions in R will return NA
if any of the entries of the input
vector is an NA
. Here is an example:
library(dslabs)
data(na_example)
mean(na_example)
#> [1] NA
sd(na_example)
#> [1] NA
To ignore the NA
’s we can use the na.rm
argument:
mean(na_example, na.rm = TRUE)
#> [1] 2.3
sd(na_example, na.rm = TRUE)
#> [1] 1.22
Let’s now explore the NHANES data.
1. We will provide some basic facts about blood pressure. First let’s
select a group to set the standard. We will use 20-to-29-year-old females.
AgeDecade
is a categorical variable with these ages. Note
that the category is coded like “ 20-29”, with a space in front! What
is the average and standard deviation of systolic blood pressure (the blood pressure can be found in theBPSysAve
column)? The result should be a data frame with a mean
and sd
column. Store this dataframe as summary
.
Hint: Use filter
and summarize
and use the na.rm = TRUE
argument
when computing the average and standard deviation. You can also filter
the NA values using filter
.
2. Using a pipe, assign the average to a numeric variable ref_mean
.
Hint: Use the code similar to above and then pull
.
3. Now do the same thing, but instead of calculating the mean and
standard deviation, report the minimum and maximum values (again for
20-to-29-year-old females). This time store the result in a data frame
min_max
with a min
and a max
column.