For data that is approximately normally distributed, it is convenient to
think in terms of standard units. The standard unit of a value tells
us how many standard deviations away from the average it is.
Specifically, for a value x
from a vector X
, we define the value of
x
in standard units as z = (x - m)/s
with m
and s
the average
and standard deviation of X
, respectively. Why is this convenient?
First look back at the formula for the normal distribution and note that what is being exponentiated is \(-z^2/2\) with \(z\) equivalent to \(x\) in standard units. Because the maximum of \(e^{-z^2/2}\) is when \(z=0\), this explains why the maximum of the distribution occurs at the average. It also explains the symmetry since \(- z^2/2\) is symmetric around 0. Second, note that if we convert the normally distributed data to standard units, we can quickly know if, for example, a person is about average (\(z=0\)), one of the largest (\(z \approx 2\)), one of the smallest (\(z \approx -2\)), or an extremely rare occurrence (\(z > 3\) or \(z < -3\)). Remember that it does not matter what the original units are, these rules apply to any data that is approximately normal.
In R, we can obtain standard units using the function scale
:
z <- scale(x)
Now to see how many men are within 2 SDs from the average, we simply type:
mean(abs(z) < 2)
#> [1] 0.95
The proportion is about 95%, which is what the normal distribution predicts! To further confirm that, in fact, the approximation is a good one, we can use quantile-quantile plots.