When working with a real life dataset you will often be confronted with missing data. There are a lot of reasons why data may be missing. For example a technical failure in a machine, participants drop out in a survey, improper data collection by a researcher and many more.
Understanding the reasons why data are missing is very important for handling the remaining data correctly. Imagine youre doing a statistical analysis to find a correlation between gender and income and female persons with a high income tend to skip the question “how much do you earn?”. By simply removing the missing data you would throw away vital information and your results would be statistically irrelevant.
For this example you can pretend the missing data is completely at random and can therefor be removed for further data analysis.
The na_example
vector represents a series of counts. You can quickly examine the object using:
library(dslabs)
data("na_example")
str(na_example)
#> int [1:1000] 2 1 3 2 1 3 1 4 3 2 ...
However, when we compute the average with the function mean
, we obtain
an NA
:
mean(na_example)
#> [1] NA
The is.na
function returns a logical vector that tells us which entries are NA
. Use this funtion on the na_example
and assign this logical vector to an object called ind
. Determine how many NA
s does na_example
have. Store your anwser in na_count
.
Now compute the average again, but only for the entries that are not NA
. Store the average in vector_mean
.
Hint
You will probably need the
!
logical operator. For example:!c(TRUE, FALSE, FALSE)
will returnc(FALSE, TRUE, TRUE)
.
Extra
Note that
mean
function has a parameter that can be used to remove theNA
values before calculating the mean. See the help page for more information.