The heights we have been looking at are not the original heights reported by students. The original reported heights are also included in the dslabs package and can be loaded like this:
library(dslabs)
data("reported_heights")
Height is a character vector so we create a new column with the numeric version:
reported_heights <- reported_heights %>%
mutate(original_heights = height, height = as.numeric(height))
#> Warning: Problem with `mutate()` input `height`.
#> x NAs introduced by coercion
#> ℹ Input `height` is `as.numeric(height)`.
#> Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion
Note that we get a warning about NAs. This is because some of the self reported heights were not numbers. We can see why we get these:
reported_heights %>% filter(is.na(height)) %>% head()
#> time_stamp sex height original_heights
#> 1 2014-09-02 15:16:28 Male NA 5' 4"
#> 2 2014-09-02 15:16:37 Female NA 165cm
#> 3 2014-09-02 15:16:52 Male NA 5'7
#> 4 2014-09-02 15:16:56 Male NA >9000
#> 5 2014-09-02 15:16:56 Male NA 5'7"
#> 6 2014-09-02 15:17:09 Female NA 5'3"
Some students self-reported their heights using feet and inches rather than just inches. Others used centimeters and others were just trolling. For now we will remove these entries:
reported_heights <- filter(reported_heights, !is.na(height))
If we compute the average and standard deviation, we notice that we obtain strange results. The average and standard deviation are different from the median and MAD:
reported_heights %>%
group_by(sex) %>%
summarize(average = mean(height), sd = sd(height),
median = median(height), MAD = mad(height))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 5
#> sex average sd median MAD
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Female 63.4 27.9 64.2 4.05
#> 2 Male 103. 530. 70 4.45
This suggests that we have outliers, which is confirmed by creating a boxplot:
We can see some rather extreme values. To see what these values are, we
can quickly look at the largest values using the arrange
function:
reported_heights %>% arrange(desc(height)) %>% top_n(10, height)
#> time_stamp sex height original_heights
#> 1 2014-09-03 23:55:37 Male 11111 11111
#> 2 2016-04-10 22:45:49 Male 10000 10000
#> 3 2015-08-10 03:10:01 Male 684 684
#> 4 2015-02-27 18:05:06 Male 612 612
#> 5 2014-09-02 15:16:41 Male 511 511
#> 6 2014-09-07 20:53:43 Male 300 300
#> 7 2014-11-28 12:18:40 Male 214 214
#> 8 2017-04-03 16:16:57 Male 210 210
#> 9 2015-11-24 10:39:45 Male 192 192
#> 10 2014-12-26 10:00:12 Male 190 190
#> 11 2016-11-06 10:21:02 Female 190 190
The first seven entries look like strange errors. However, the next few look like they were entered as centimeters instead of inches. Since 184 cm is equivalent to six feet tall, we suspect that 184 was actually meant to be 72 inches.
We can review all the nonsensical answers by looking at the data considered to be far out by Tukey:
whisker <- 3*IQR(reported_heights$height)
max_height <- quantile(reported_heights$height, .75) + whisker
min_height <- quantile(reported_heights$height, .25) - whisker
reported_heights %>%
filter(!between(height, min_height, max_height)) %>%
select(original_heights) %>%
head(n=10) %>% pull(original_heights)
#> [1] "6" "5.3" "511" "6" "2" "5.25" "5.5" "11111"
#> [9] "6" "6.5"
Examining these heights carefully, we see two common mistakes: entries
in centimeters, which turn out to be too large, and entries of the form
x.y
with x
and y
representing feet and inches, respectively, which
turn out to be too small. Some of the even smaller values, such as 1.6,
could be entries in meters.
In the Data Wrangling part of this book we will learn techniques for correcting these values and converting them into inches. Here we were able to detect this problem using careful data exploration to uncover issues with the data: the first step in the great majority of data science projects.