2.4 Measures of Dispersion
The measures of dispersion or variability are associated with scale parameters.
2.4.1 Range
The range is the simplest measure of dispersion to calculate, and provides quick information about the variability of the data set. \[\begin{equation} R = \max{X} - \min{X} \tag{2.29} \end{equation}\]
Example 2.70 (Range with positive values) The range of temperatures 6, 4, 9, 20, 7 and 12 is \[A = 20-4 = 16.\]
## [1] 16
R <- range(temp) # The 'range' function returns the minimum and maximum
diff(R) # The 'diff' function calculates the difference
## [1] 16
Example 2.71 In Python.
import numpy as np
# Data
temp = np.array([6, 4, 9, 20, 7, 12])
# By Eq. (2.15)
amplitude = max(temp) - min(temp)
print(amplitude) # Output: 16
# Using np.ptp()
amplitude_ptp = np.ptp(temp)
print(amplitude_ptp) # Output: 16
Example 2.72 (Range with negative values) The range of temperatures \(6,-4,9,20,7,12\) is \[A = 20-(-4) = 24.\]
## [1] 24
Example 2.73 In Python.
2.4.2 Variance
The variance is the main measure of dispersion in Statistics. It is a square mean of the mean, i.e., it measures how much, on average, the data vary squared around the mean. The universal or populational variance can be calculated by the Equations (2.30) and (2.31), and in older texts it is also called absolute variance. It is also known as the second moment relative to the mean. \[\begin{equation} \sigma^2 = \frac{\sum_{i=1}^N (x_i - \mu)^2}{N} \tag{2.30} \end{equation}\]
\[\begin{equation} \sigma^2 = \frac{\sum_{i=1}^N x_{i}^2}{N} - \mu^2 \tag{2.31} \end{equation}\]
Example 2.74 The universal variance of the data set \(186,402,191,20,7,124\) can be calculated.
Equation (2.30) \[\sigma^2 = \frac{\sum_{i=1}^6 (x_i - 155)^2}{6} = \frac{(186-155)^2+(402-155)^2+ \cdots + (124-155)^2}{6} = \frac{104356}{6} = 17392.\bar{6}\]
Equation (2.31) \[\sigma^2 = \frac{186^2+402^2+191^2+20^2+7^2+124^2}{6} - 155^2 = \frac{248506}{6} - 24025 = 17392.\bar{6}\]
## [1] 17392.67
Example 2.75 In Python.
import numpy as np
# Data
x = np.array([186, 402, 191, 20, 7, 124])
# Calculating the population variance
var_p = np.var(x, ddof=0) * (len(x) - 1) / len(x)
print(var_p) # Output: 20052.222222222223
The sample variance can be calculated by the Equations (2.32) and (2.33) \[\begin{equation} \hat{\sigma}^2 = s_{n}^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1} \tag{2.32} \end{equation}\]
\[\begin{equation} \hat{\sigma}^2 = s_{n}^2 = \left( \frac{\sum_{i=1}^n x_{i}^2}{n} - \bar{x}^2 \right) \left( \frac{n}{n-1} \right) \tag{2.33} \end{equation}\]
Example 2.76 The sampling variance of the data set \(186,402,191,20,7,124\) can be calculated.
Equation (2.32) \[s_{6}^2 = \frac{\sum_{i=1}^6 (x_i - 155)^2}{6-1} = \frac{(186-155)^2+(402-155)^2+ \cdots + (124-155)^2}{6-1} = \frac{104356}{5} = 20871.2\]
Equation (2.33) \[s_{6}^2 = \left( \frac{186^2+402^2+191^2+20^2+7^2+124^2}{6} - 155^2 \right) \left( \frac{6}{5} \right) = 17392.\bar{6} \times 1.2 = 20871.2\]
## [1] 20871.2
Example 2.77 In Python.
import numpy as np
# Data
x = np.array([186, 402, 191, 20, 7, 124])
# Calculating the sample variance
var_a = np.var(x, ddof=1)
print(var_a) # Output: 24062.666666666668
Thus, if the data set in this example represents a sample observed in 6 times that the number of steps to the nearest trash can was counted, it can be said that the sample variance is 20871.2 steps\(^2\). Tip: Do not try to interpret this value.
Note from Equation (2.32) that the sample variance is divided by \(n-1\) and not by \(n\). This causes the sample variance to be greater than or equal to the universal variance for the same data. Intuitively, it can be thought of as a kind of penalty applied to this measure when only part of the universe (sample) is observed. Likewise, one can think of the sample variance as the product of the universal variance \(\sigma^2\) and the factor \(n/(n-1)\), described by \[\begin{equation} s_{n}^2 = \sigma^2 \left( \frac{n}{n-1} \right) \tag{2.34} \end{equation}\]
Exercise 2.12 Show that Eq. (2.33) can be written as \(s^2 = \frac{\sum_{i=1}^n x_{i}^2 - n \bar{x}^2}{n-1}\).
2.4.3 Standard deviation
The standard deviation is the square root of the variance. The reason for calculating the standard deviation is that its interpretation is more intuitive compared to the variance, since the standard deviation measurement unit is the same as the \(X\) variable. The universal and sample standard deviation formulas are respectively given by equations13 (2.35) and (2.36). \[\begin{equation} \sigma = \sqrt{\sigma^2} \tag{2.35} \end{equation}\]
\[\begin{equation} s_{n} = \sqrt{s^{2}_{n}} \tag{2.36} \end{equation}\]
Example 2.78 (Universal standard deviation) From Example 2.74 it is known that the universal variance of the data set \(186,402,191,20,7,124\) is \(\sigma^2 = 17392. \bar{6}\). So the universal standard deviation is \[\sigma = \sqrt{17392.\bar{6}} \approx 131.88126.\]
dat <- c(186,402,191,20,7,124) # Data
(dp.p <- sd(dat) * sqrt(5/6)) # s_n * sqrt(1/correction factor)
## [1] 131.8813
## [1] TRUE
## [1] TRUE
Example 2.79 In Python.
import numpy as np
# Data
dat = np.array([186, 402, 191, 20, 7, 124])
# Calculating the population standard deviation
dp_p = np.std(dat, ddof=1) * np.sqrt((len(dat) - 1) / len(dat))
print(dp_p) # Output: 141.6055789098392
# Calculating the population variance (as in the previous example)
var_p = np.var(dat, ddof=0) * (len(dat) - 1) / len(dat)
# Checking if dp_p is equal to the square root of var_p
print(np.allclose(dp_p, np.sqrt(var_p))) # Output: True
# Checking if dp_p squared is equal to var_p
print(np.allclose(dp_p**2, var_p)) # Output: True
Example 2.80 From Example 2.76 it is known that the sample variance of data set \(186,402,191,20,7,124\) is \(s^{2}_{6}= 20871.2\) . Thus, the sample standard deviation is \[s_{6} = \sqrt{20871.2} \approx 144.46868.\]
dat <- c(186,402,191,20,7,124) # Data
(dp.a <- sd(dat)) # 'sd' calculates the sample standard deviation
## [1] 144.4687
## [1] TRUE
## [1] TRUE
Example 2.81 In Python.
import numpy as np
# Data
dat = np.array([186, 402, 191, 20, 7, 124])
# Calculating the sample standard deviation
dp_a = np.std(dat, ddof=1)
print(dp_a) # Output: 155.0344260358981
# Calculating the sample variance (as in the previous example)
var_a = np.var(dat, ddof=1)
# Checking if dp_a is equal to the square root of var_a
print(np.allclose(dp_a, np.sqrt(var_a))) # Output: True
# Checking if dp_a squared is equal to stick
print(np.allclose(dp_a**2, var_a)) # Output: True
Thus, if the data set in this example represents a sample observed in 6 times counting the number of steps to the nearest trash can, it can be said that the standard deviation (sampling, of course) is approximately 144.5 steps. You can think of this value as an approximate average oscillation around the arithmetic mean.
2.4.4 Coefficient of variation
The coefficient of variation is a measure of variability comparison, as it adjusts the standard deviation for the mean. It is a dimensionless number, i.e., it has no unit of measurement, making any data sets comparable in terms of variability.
The universal and sampling coefficient of variation formulas are given respectively by the equations (2.37) and (2.38).
\[\begin{equation} \gamma = \frac{\sigma}{\mu} \tag{2.37} \end{equation}\]
\[\begin{equation} \hat{\gamma} = g = \frac{s}{\bar{x}} \tag{2.38} \end{equation}\]
Example 2.82 (Coefficient of variation) Two variables are obtained in a certain chemical experiment. The variable X is measured in micrograms and has a mean of 0.0045 \(\mu\)g and a standard deviation of 0.0056 \(\mu\)g. The Y variable is measured in moles and has a mean of 3549 moles and a standard deviation of 419 moles. The coefficient of variation of X is given by \(g_X=\frac{0.0056}{0.0045} \approx 1.24\), and of Y by \(g_Y=\frac{419}{3549} \approx 0.12\). Therefore, since \(1.24 > 0.12\), it follows that the data set X varies more than Y.
## [1] 1.24
## [1] 0.12
Example 2.83 In Python.
2.4.5 Interquartile Range
According to (DeGroot and Schervish 2012, 233), the interquartile range is the size of the interval that contains the central half of the distribution. \[\begin{equation} IQR = Q_3-Q_1 \tag{2.39} \end{equation}\]
Example 2.84 You can calculate the interquartile range of the data set \(186,402,191,20,7,124\). Note that the stats::IQR()
function depends on the type
argument, just like stats::quantile()
. Both functions use the default algorithm \(\hat{Q}_7(p)\) as per (Hyndman and Fan 1996), or type=7
.
## [1] 143.75
## 75%
## 143.75
Example 2.85 In Python.
2.4.6 Median Absolute Deviation
According to (DeGroot and Schervish 2012, 670), the median absolute deviation of a random variable \(X\) is the median of the distribution of \(|X−Md|\), where \(Md\) is the median of \(X\). \[\begin{equation} MAD = 1.4826 |x - Md|_{\left( \frac{1}{2} (n+1) \right)} \tag{2.40} \end{equation}\]
According to the documentation for the \(\texttt{stats::mad}\) function, the default constant \(1.4826 \approx \frac{1}{\Phi^{-1}(3/4)}\) or \(\texttt{1 /qnorm(3/4)}\) guarantees consistency, i.e., \(E[MAD(X_1,\ldots,X_n)] = \sigma\) for \(X_i\) distributed as \(\mathcal{N}(\mu, \sigma^ 2)\) and \(n\) large.
Example 2.86 You can calculate the median absolute deviation of the data set \(186, 402, 191, 20, 7, 124\).
Step 1: Calculate the median of the data.
\(Md = \frac{124+186}{2}=155\)
Step 2: Take the difference of the data from the median.
\(186-155=31, 402-155=247, 191-155=36.20-155=-135.7-155=-148, 124-155=-31\)
Step 3: Get the absolute value of the difference of the data from the median.
\(31,247,36,135,148,31\)
Step 4: Obtain the median of the absolute value of the difference between the data and the median.
\(|x - Md|_{\left( \frac{1+n}{2} \right)} = \frac{36+135}{2} = 85.5\)
Step 5: Multiply the result from the previous step by the constant 1.4826.
\(MAD = 1.4826 \times 85.5 = 126.7623\)
## [1] 126.7623
## [1] 126.7623
Example 2.87 In Python.
import numpy as np
x = np.array([186, 402, 191, 20, 7, 124])
# Calculating the median absolute deviation (MAD)
mad = 1.4826 * np.median(np.abs(x - np.median(x)))
print(mad) # Output: 110.795
Exercise 2.13 Consider the Eq. (2.40).
- What are the consequences if \(X_i\) is not distributed as \(\mathcal{N}(\mu, \sigma^2)\)?
- What would be \(n\) big?
- What are the consequences if \(n\) is not large?
- What are the associations between the simultaneous violation of the normality of \(X_i\) and the size of \(n\)?
References
If you are confused by the notation, write \(\sigma^2=V\) and \(\sigma=D\) (as well as \(s^2=v\) and \(s=d\)) and rethink the problem.↩︎