When working with a real life dataset you will often be confronted with missing data. There are a lot of reasons why data may be missing. For example a technical failure in a machine, participants drop out in a survey, improper data collection by a researcher and many more.

Understanding the reasons why data are missing is very important for handling the remaining data correctly. Imagine youre doing a statistical analysis to find a correlation between gender and income and female persons with a high income tend to skip the question “how much do you earn?”. By simply removing the missing data you would throw away vital information and your results would be statistically irrelevant.

For this example you can pretend the missing data is completely at random and can therefor be removed for further data analysis.

Example

The na_example vector represents a series of counts. You can quickly examine the object using:

library(dslabs)
data("na_example")  
str(na_example)
#>  int [1:1000] 2 1 3 2 1 3 1 4 3 2 ...

However, when we compute the average with the function mean, we obtain an NA:

mean(na_example)
#> [1] NA

Exercise

  1. The is.na function returns a logical vector that tells us which entries are NA. Use this funtion on the na_example and assign this logical vector to an object called ind. Determine how many NAs does na_example have. Store your anwser in na_count.

  2. Now compute the average again, but only for the entries that are not NA. Store the average in vector_mean.

    Hint

    You will probably need the ! logical operator. For example: !c(TRUE, FALSE, FALSE) will return c(FALSE, TRUE, TRUE).

    Extra

    Note that mean function has a parameter that can be used to remove the NA values before calculating the mean. See the help page for more information.