2.4 Measures of Dispersion

The measures of dispersion or variability are associated with scale parameters.

2.4.1 Range

The range is the simplest measure of dispersion to calculate, and provides quick information about the variability of the data set. \[\begin{equation} R = \max{X} - \min{X} \tag{2.20} \end{equation}\]

Example 2.34 (Range with positive values) The range of temperatures 6, 4, 9, 20, 7 and 12 is \[A = 20-4 = 16.\] \(\\\)

temp <- c(6,4,9,20,7,12)  # Data
max(temp)-min(temp)       # By Eq. (2.15)
## [1] 16
R <- range(temp)          # The 'range' function returns the minimum and maximum
diff(R)                   # The 'diff' function calculates the difference
## [1] 16

Example 2.35 (Range with negative values) The range of temperatures 6, -4, 9, 20, 7 and 12 is \[A = 20-(-4) = 24.\] \(\\\)

temp <- c(6,-4,9,20,7,12) # Data
diff(range(temp))         # Nested functions
## [1] 24

2.4.2 Variance

The variance is the main measure of dispersion in Statistics. It is a square mean of the mean, i.e., it measures how much, on average, the data vary squared around the mean. The universal variance can be calculated by the Equations (2.21) and (2.22), and in older texts it is also called absolute variance.

\[\begin{equation} \sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N} \tag{2.21} \end{equation}\]

\[\begin{equation} \sigma^2 = \frac{\sum_{i=1}^N x_{i}^2}{N} - \mu^2 \tag{2.22} \end{equation}\]

Example 2.36 The universal variance of the data set 186, 402, 191, 20, 7 and 124 is

Equation (2.21) \[\sigma^2 = \frac{\sum_{i=1}^6 (x_i - 155)^2}{6} = \frac{(186-155)^2+(402-155)^2+ \cdots + (124-155)^2}{6} = \frac{104356}{6} = 17392.\bar{6}\]

Equation (2.22) \[\sigma^2 = \frac{186^2+402^2+191^2+20^2+7^2+124^2}{6} - 155^2 = \frac{248506}{6} - 24025 = 17392.\bar{6}\]

(var.p <- var(c(186,402,191,20,7,124))*(5/6))   # (Sample variance)*(1/correction factor)
## [1] 17392.67

The sample variance can be calculated by the Equations (2.23) and (2.24)

\[\begin{equation} \hat{\sigma}^2 = s_{n}^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1} \tag{2.23} \end{equation}\]

\[\begin{equation} \hat{\sigma}^2 = s_{n}^2 = \left( \frac{\sum_{i=1}^n x_{i}^2}{n} - \bar{x}^2 \right) \left( \frac{n}{n-1} \right) \tag{2.24} \end{equation}\]

Example 2.37 The sample variance of data set 186, 402, 191, 20, 7 and 124 is

Equation (2.23) \[s_{6}^2 = \frac{\sum_{i=1}^6 (x_i - 155)^2}{6-1} = \frac{(186-155)^2+(402-155)^2+ \cdots + (124-155)^2}{6-1} = \frac{104356}{5} = 20871.2\]

Equation (2.24) \[s_{6}^2 = \left( \frac{186^2+402^2+191^2+20^2+7^2+124^2}{6} - 155^2 \right) \left( \frac{6}{5} \right) = 17392.\bar{6} \times 1.2 = 20871.2\]

(var.a <- var(c(186,402,191,20,7,124)))     # 'var' calculates the sample variance
## [1] 20871.2

Thus, if the data set in this example represents a sample observed in 6 times that the number of steps to the nearest trash can was counted, it can be said that the sample variance is 20871.2 steps\(^2\). Tip: Do not try to interpret this value.

Note from Equation (2.23) that the sample variance is divided by \(n-1\) and not by \(n\). This causes the sample variance to be greater than or equal to the universal variance for the same data. Intuitively, it can be thought of as a kind of penalty applied to this measure when only part of the universe (sample) is observed. Likewise, one can think of the sample variance as the product of the universal variance \(\sigma^2\) and the factor \(n/(n-1)\), described by

\[\begin{equation} s_{n}^2 = \sigma^2 \left( \frac{n}{n-1} \right) \tag{2.25} \end{equation}\]

2.4.3 Standard deviation

The standard deviation is the square root of the variance. The reason for calculating the standard deviation is that its interpretation is more intuitive compared to the variance, since the standard deviation measurement unit is the same as the \(X\) variable. The universal and sample standard deviation formulas are respectively given by equations12 (2.26) and (2.27).

\[\begin{equation} \sigma = \sqrt{\sigma^2} \tag{2.26} \end{equation}\]

\[\begin{equation} s_{n} = \sqrt{s^{2}_{n}} \tag{2.27} \end{equation}\]

Example 2.38 (Universal standard deviation) From Example 2.36 it is known that the universal variance of the data set 186, 402, 191, 20, 7 and 124 is \(\sigma^2 = 17392. \bar{6}\). So the universal standard deviation is \[\sigma = \sqrt{17392.\bar{6}} \approx 131.88126.\]

dat <- c(186,402,191,20,7,124)    # Data
(dp.p <- sd(dat) * sqrt(5/6))     # s_n * sqrt(1/correction factor)
## [1] 131.8813
all.equal(dp.p, sqrt(var.p))      # 'dp.p' is equal to the square root of 'var.p'
## [1] TRUE
all.equal(dp.p^2, var.p)          # 'dp.p' squared equals 'var.p'
## [1] TRUE

Example 2.39 From Example 2.37 it is known that the sample variance of data set 186, 402, 191, 20, 7 and 124 is \(s^{2}_{6}= 20871.2\) . Thus, the sample standard deviation is \[s_{6} = \sqrt{20871.2} \approx 144.46868.\]

dat <- c(186,402,191,20,7,124)    # Data
(dp.a <- sd(dat))                 # 'sd' calculates the sample standard deviation
## [1] 144.4687
all.equal(dp.a, sqrt(var.a))      # 'dp.a' is equal to the square root of 'var.a'
## [1] TRUE
all.equal(dp.a^2, var.a)          # 'dp.a' squared equals 'var.a'
## [1] TRUE

Thus, if the data set in this example represents a sample observed in 6 times counting the number of steps to the nearest trash can, it can be said that the standard deviation (sampling, of course) is approximately 144.5 steps. You can think of this value as an approximate average oscillation around the arithmetic mean.

2.4.4 Interquartile Range

\[\begin{equation} IQR = Q_3-Q_1 \tag{2.28} \end{equation}\]

x <- c(186,402,191,20,7,124)
IQR(x)
## [1] 143.75
quantile(x,3/4)-quantile(x,1/4)
##    75% 
## 143.75

2.4.5 Median Absolute Deviation

\[\begin{equation} MAD = 1.4826 |x - Md|_{\left( \frac{1}{2} (n+1) \right)} \tag{2.29} \end{equation}\]

x <- c(186,402,191,20,7,124)
mad(x)
## [1] 126.7623
1.4826*median(abs(x-median(x)))
## [1] 126.7623

According to the documentation for the \(\texttt{stats::mad}\) function, the default constant \(1.4826 \approx \frac{1}{\Phi^{-1}(3/4)}\) or \(\texttt{1 /qnorm(3/4)}\) guarantees consistency, i.e., \(E[MAD(X_1,\ldots,X_n)] = \sigma\) for \(X_i\) distributed as \(\mathcal{N}(\mu, \sigma^ 2)\) and \(n\) large.

Exercise 2.11 Consider the Eq. (2.29).

  1. What are the consequences if \(X_i\) is not distributed as \(\mathcal{N}(\mu, \sigma^2)\)?
  2. What would be \(n\) big?
  3. What are the consequences if \(n\) is not large?
  4. What are the associations between the simultaneous violation of the normality of \(X_i\) and the size of \(n\)?

\(\\\)

2.4.6 Coefficient of variation

The coefficient of variation is a measure of variability comparison, as it adjusts the standard deviation for the mean. It is a dimensionless number, i.e., it has no unit of measurement, making any data sets comparable in terms of variability.

The universal and sampling coefficient of variation formulas are given respectively by the equations (2.30) and (2.31).

\[\begin{equation} \gamma = \frac{\sigma}{\mu} \tag{2.30} \end{equation}\]

\[\begin{equation} \hat{\gamma} = g = \frac{s}{\bar{x}} \tag{2.31} \end{equation}\]

Example 2.40 (Coefficient of variation) Two variables are obtained in a certain chemical experiment. The variable X is measured in micrograms and has a mean of 0.0045 \(\mu\)g and a standard deviation of 0.0056 \(\mu\)g. The Y variable is measured in moles and has a mean of 3549 moles and a standard deviation of 419 moles. The coefficient of variation of X is given by \(g_X=\frac{0.0056}{0.0045} \approx 1.24\), and of Y by \(g_Y=\frac{419}{3549} \approx 0.12\). Therefore, since \(1.24 > 0.12\), it follows that the data set X varies more than Y.

mx <- 0.0045
dx <- 0.0056
round(gx <- dx/mx, 2)   # Coefficient of variation of X
## [1] 1.24
my <- 3549
dy <- 419
round(gy <- dy/my, 2)   # Coefficient of variation of Y
## [1] 0.12

  1. If you are confused by the notation, write \(\sigma^2=V\) and \(\sigma=D\) (as well as \(s^2=v\) and \(s=d\)) and rethink the problem.↩︎