2.3 Measures of Location
The measures of location or position are associated with location parameters.
2.3.1 Minimum and Maximum
The minimum of a distribution is the smallest observed value of that distribution; analogously, the maximum is the largest value. They are order statistics, more specifically the extremes of a(n ordered) list. For a distribution of \(n\) elements they are denoted by \(\min X = x_{(1)}\) and \(\max X = x_{(n)}\).
Despite the simplicity of these measures, there are sophisticated theoretical considerations about them. For more details, see (S. Kotz and Nadarajah 2000).
Example 2.20 (Minimum and maximum) Assume again the \(n=100\) observations of the variable Y: ‘height of women assisted in a hospital’, presented in Example ??. The minimum and maximum are denoted, respectively, by \(\min Y = y_{(1)} = 1.51\) and \(\max Y = y_{(100)} = 1.74\). \(\\\)
## [1] 1.51
## [1] 1.74
## [1] 1.51 1.74
2.3.2 (Arithmetic) Mean
The (arithmetic) mean or (arithmetic) average is one of the most important measures in Statistics due to its properties and relative ease of calculation. The mean of the variable \(X\) is generically symbolized by \(\mu\) when referring to the universal mean, and by \(\bar{x}\) when referring to the sample mean. You can use the notation \(\bar{x}_{n}\) to indicate the sample size. Their expressions in universe and in the sample are respectively given by the equations (2.8) and (2.9). Because it distributes the sum of the distribution values over the number of observations, the mean is a measure that indicates the center of mass. \[\begin{equation} \mu = \frac{\sum_{i=1}^N x_i}{N} \tag{2.8} \end{equation}\]
\[\begin{equation} \bar{x}_{n} = \frac{\sum_{i=1}^n x_i}{n} \tag{2.9} \end{equation}\]
Example 2.21 (Arithmetic mean) Assume again the data from Example 1.6. The average number of steps to the nearest trash can was \[\bar{x}_6 = \frac{\sum_{i=1}^6 x_i}{6} = \frac{186+402+191+20+7+124}{6} = \frac{930}{6} = 155.\]
## [1] 155
Weighted (arithmetic) mean
The weighted (arithmetic) mean allows to assign different weights to observations.
\[\begin{equation} \bar{x}_{n} = \frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i} \tag{2.10} \end{equation}\]
Trimmed Mean
Trimming an array of data means trimming a fraction (usually between 0 and 0.5) from each end of the sorted array. The trimmed mean consists of calculating the simple arithmetic mean of the trimmed vector. A formal definition can be found at (Yuen 1974, 166).
## [1] 159.49
## [1] 60.68367347
## [1] 60.68367347
## [1] 51.5
## [1] 51.5
Winsorized Mean
Winsorize an (ordered) array means replacing a certain proportion of extreme values with less extreme values. Thus, the surrogate values are the most extreme retained values. A formal definition can be found at (Yuen 1974, 166).
x <- c(2:99,1000,10000) # Original vector, containing extreme values
(xw <- DescTools::Winsorize(x, probs = c(0.01, 0.99)))
## [1] 2.99 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00
## [15] 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00
## [29] 30.00 31.00 32.00 33.00 34.00 35.00 36.00 37.00 38.00 39.00 40.00 41.00 42.00 43.00
## [43] 44.00 45.00 46.00 47.00 48.00 49.00 50.00 51.00 52.00 53.00 54.00 55.00 56.00 57.00
## [57] 58.00 59.00 60.00 61.00 62.00 63.00 64.00 65.00 66.00 67.00 68.00 69.00 70.00 71.00
## [71] 72.00 73.00 74.00 75.00 76.00 77.00 78.00 79.00 80.00 81.00 82.00 83.00 84.00 85.00
## [85] 86.00 87.00 88.00 89.00 90.00 91.00 92.00 93.00 94.00 95.00 96.00 97.00 98.00 99.00
## [99] 1000.00 1090.00
## [1] 159.49
## [1] 70.3999
2.3.3 Total
Total is the sum of all values of a variable. It is expressed by the equations (2.11) and (2.12).
\[\begin{equation} \tau = \sum_{i=1}^N x_i \tag{2.11} \end{equation}\]
\[\begin{equation} \hat{\tau} = N \bar{x}_{n}, \tag{2.12} \end{equation}\]
where \(\bar{x}_{n}\) is the sample mean, presented in Equation (2.9).
Example 2.22 (Total) Reassume the data from Example 2.21. If someone needs a trash can 60 times in the capital of Rio Grande do Sul, it is estimated that the total number of steps to be walked is \[\hat{\tau} = \frac{60}{6} \times 930 = 60 \times 155 = 9300\]
N <- 60 # Universe/population size
x <- c(186,402,191,20,7,124) # Raw data
N*mean(x) # Equation (2.11)
## [1] 9300
2.3.4 Mean Square
The mean square is the mean of the squared values, used in the calculation of variances. \[\begin{equation} MS = \frac{\sum_{i=1}^n x_{i}^{2}}{n}. \tag{2.13} \end{equation}\]
The root mean square (RMS) is the square root of the mean square. \[\begin{equation} RMS=\sqrt{MS}. \tag{2.14} \end{equation}\]
Example 2.23 (MS and RMS) The mean square of the values 186, 402, 191, 20, 7 and 124 is \[MS = \frac{\sum_{i=1}^6 x_{i}^{2}}{6} = \frac{186^2+402^2+191^2+20^2+7^2+124^2}{6} = \frac{248506}{6} = 41417.\bar{6}.\] The RMS (root mean square) is \[RMS = \sqrt{41417.\bar{6}} \approx 203.5133.\]
## [1] 41417.66667
## [1] 203.5133083
2.3.5 Mode
Mode(s) is (are) the most frequent value(s) in a distribution. When there is only one mode, the distribution is known as unimodal. If there are two modes, the distribution is bimodal. Three modes configure a trimodal distribution, and four or more modes indicate a multimodal distribution. Distributions with equivalent frequencies for all values are said to be amodal. When data are grouped, the modal class must be indicated, i.e., the class with the highest frequency. The computational effort to calculate the mode is to perform a count.
In R there is the Mode
function from the pracma
package, but it only works well in the unimodal case. Therefore, the Modes
function is presented below, adapted from the suggestion by digEmAll in this StackOverflow discussion. The following examples compare the two approaches.
# Modes function
Modes <- function(x) {
ux <- sort(unique(x))
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
Example 2.24 (Unimodal) The mode of the data set 4, 7, 1, 3, 3, 9 is \(Mo=3\), as it has a frequency of 2 while the other values have a frequency of 1. This is a unimodal distribution.
## [1] 3
## [1] 3
Example 2.25 (Bimodal) The modes of the data set 4, 7, 1, 3, 3, 9, 7 are \(Mo'=3\) and \(Mo''=7\), as both have frequency 2 while the other values have frequency 1. The order of presentation is indifferent. This is a bimodal distribution.
## [1] 3 7
## [1] 3
Example 2.26 (Amodal) The data set 4, 7, 1, 3, 9 is said to be amodal because all values have frequency 1.
## [1] 1 3 4 7 9
## [1] 1
Example 2.27 (Mode for grouped data) In the Example 2.18 it is observed that \(f_{3}=41\) is the highest frequency. The modal class is therefore the third, comprised between the values 1.60 and 1.65.
2.3.6 Quantile
Quantiles are measures that divide an ordered set of data into \(k\) equal parts. The basic method consists of obtaining a roll of data and finding (albeit approximately) the values that divide the distribution according to the desired \(k\). The computational effort to calculate any separatrix is, therefore, the sorting of the data. In general, a separatrix \(S\) can be defined according to Eq. (2.15), where \(n\) indicates the number of observations and \(p\) the proportion of observations ordered below \(S\).
\[\begin{equation} S = x_{(p(n+1))} \tag{2.15} \end{equation}\]
The stats::quantile
function has nine methods for obtaining quatiles, so the documentation is recommended for more details. With it, you can easily obtain the desired quantiles, just by adjusting the \(p\) argument. Note that the function returns the quantiles expressed in percentiles, where \(0\%\) equals the minimum and \(100\%\) the maximum.
Median (\(k=2\))
The median is the measure that divides half of the sorted data (list) to its left and the other half to its right, i.e., it is the central measure in terms of sorting. Its position is the average between the first and last positions.
\[\begin{equation}
Pos = \frac{1+n}{2}
\tag{2.16}
\end{equation}\]
Example 2.28 The median is the measure that divides half of the sorted data (list) to its left and the other half to its right, i.e., it is the central measure in terms of sorting. It can be defined by Eq. (2.17).
\[\begin{equation} Md = x_{\left( \frac{1}{2} (n+1) \right)} \tag{2.17} \end{equation}\]
## [1] 50
## 50%
## 50
Example 2.29 (Median for \(n\) odd) Let the data set be 10, -4, 11, 12, 1, 5, 15, formed by \(n=7\) values. When ordered, we obtain the list -4, 1, 5, 10, 11, 12, 15. Considering \(k=2\), we obtain the quantile \(Md=10\), as it divides the set into two parts of the same size (three values below the median 10 and three values above). Its position is given by \(Pos=\frac{1+7}{2}=4\).
## [1] 7
## [1] 4
## [1] -4 1 5 10 11 12 15
## [1] 10
Example 2.30 (Median for \(n\) even) When the number of observations is even, just take the average of the two central values of the roll. Let the data set be 15, -4, 11, 12, 1, 5, formed by \(n=6\) values. When ordered, we obtain the list -4, 1, 5, 11, 12, 15. Considering again \(k=2\), we obtain the quantile \(Md=\frac{5+11}{2}=8\), because it divides the set into two parts of the same size (three values below 8 and three values above). Its position is given by \(Pos=\frac{1+6}{2}=3.5\), i.e., the median is an intermediate value between the third and fourth positions.
## [1] 6
## [1] 3.5
## [1] -4 1 5 11 12 15
## [1] 8
Example 2.31 The first and third quartiles can be defined respectively by Eq. (2.18) and (2.19).
\[\begin{equation} Q_1 = x_{\left( \frac{1}{4} (n+1) \right)} \tag{2.18} \end{equation}\]
\[\begin{equation} Q_3 = x_{\left( \frac{3}{4} (n+1) \right)} \tag{2.19} \end{equation}\]
## 25% 75%
## 25 75
A dataset can be divided into \(k\) sectors, the main ones being shown in the following table
\(k\) | \(p\) | Name | Symbol |
---|---|---|---|
2 | 1/2 | Median | Md |
3 | 1/3, 2/3 | Tertile | \(T_1\), \(T_2\) |
4 | 1/4, 2/4, 3/4 | Quartile | \(Q_1\), \(Q_2\), \(Q_3\) |
10 | 1/10, …, 9/10 | Decile | \(D_1\), \(D_2\), \(\ldots\), \(D_9\) |
100 | 1/100, …, 99/100 | Percentile | \(P_1\), \(P_2\), \(\ldots\), \(P_{99}\) |
Example 2.32 Some quantiles.
h <- read.csv('https://filipezabala.com/data/hospital.csv')
options(digits = 4) # To improve the presentation
quantile(h$height, probs = seq(0, 1, 1/2)) # Median
## 0% 50% 100%
## 1.510 1.625 1.740
## 0% 33.33333% 66.66667% 100%
## 1.51 1.61 1.65 1.74
## 0% 25% 50% 75% 100%
## 1.510 1.598 1.625 1.650 1.740
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1.510 1.569 1.590 1.600 1.616 1.625 1.640 1.650 1.660 1.680 1.740
Exercise 2.9 Consider the separatrixes discussed in this section.
a. Check that the median (Md), second quartile (\(Q_2\)) separators are equivalent.
b. Are there other measures equivalent to those in item (a)? Justify.
c. Consider some \(k\) different from those presented and assign a name and symbology.
d. If there are \(k\) ‘slices’, how many quantiles are there?
Suggestion: Chapter ?? \(\\\)
Exercise 2.10 Using the quantile
function calculate the separatrixes discussed in this Section with the children
column data available in https://filipezabala.com/data/hospital.csv.
Suggestion: Chapter ??
2.3.7 5-number summary
The 5-number summary was suggested by (Tukey 1977). It includes minimum, maximum, median and lower and upper hinges. We will refer to the lower hinge as the median between the minimum and the median of the entire set. The upper hinge is the median between the median of the entire set and the maximum. Depending on the algorithm used to calculate the quartiles, the hinges may differ slightly from these quantiles.
Example 2.33 Consider the dataset used by (Tukey 1977, 33).
## [1] -3.2 0.1 1.5 3.0 9.8
## 0% 25% 50% 75% 100%
## -3.2 0.1 1.5 3.0 9.8