## 2.3 Measures of Location

The measures of *location* or *position* are associated with location parameters.

### 2.3.1 Minimum and Maximum

The *minimum* of a distribution is the smallest observed value of that distribution; analogously, the *maximum* is the largest value. They are order statistics, more specifically the extremes of a(n ordered) list. For a distribution of \(n\) elements they are denoted by \(\min X = x_{(1)}\) and \(\max X = x_{(n)}\).

Despite the simplicity of these measures, there are sophisticated theoretical considerations about them. For more details, see (S. Kotz and Nadarajah 2000).

**Example 2.20 **(Minimum and maximum) Assume again the \(n=100\) observations of the variable Y: ‘height of women assisted in a hospital’, presented in Example **??**. The minimum and maximum are denoted, respectively, by \(\min Y = y_{(1)} = 1.51\) and \(\max Y = y_{(100)} = 1.74\). \(\\\)

`## [1] 1.51`

`## [1] 1.74`

`## [1] 1.51 1.74`

### 2.3.2 (Arithmetic) Mean

The *(arithmetic) mean* or *(arithmetic) average* is one of the most important measures in Statistics due to its properties and relative ease of calculation. The mean of the variable \(X\) is generically symbolized by \(\mu\) when referring to the universal mean, and by \(\bar{x}\) when referring to the sample mean. You can use the notation \(\bar{x}_{n}\) to indicate the sample size. Their expressions in universe and in the sample are respectively given by the equations (2.8) and (2.9). Because it distributes the sum of the distribution values over the number of observations, the mean is a measure that indicates the center of mass.
\[\begin{equation}
\mu = \frac{\sum_{i=1}^N x_i}{N}
\tag{2.8}
\end{equation}\]

\[\begin{equation} \bar{x}_{n} = \frac{\sum_{i=1}^n x_i}{n} \tag{2.9} \end{equation}\]

**Example 2.21 **(Arithmetic mean) Assume again the data from Example 1.6. The average number of steps to the nearest trash can was \[\bar{x}_6 = \frac{\sum_{i=1}^6 x_i}{6} = \frac{186+402+191+20+7+124}{6} = \frac{930}{6} = 155.\]

`## [1] 155`

#### Weighted (arithmetic) mean

The weighted (arithmetic) mean allows to assign different weights to observations.

\[\begin{equation} \bar{x}_{n} = \frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i} \tag{2.10} \end{equation}\]

#### Trimmed Mean

*Trimming* an array of data means trimming a fraction (usually between 0 and 0.5) from each end of the sorted array. The *trimmed mean* consists of calculating the simple arithmetic mean of the trimmed vector. A formal definition can be found at (Yuen 1974, 166).

`## [1] 159.49`

`## [1] 60.68367347`

`## [1] 60.68367347`

`## [1] 51.5`

`## [1] 51.5`

#### Winsorized Mean

*Winsorize* an (ordered) array means replacing a certain proportion of extreme values with less extreme values. Thus, the surrogate values are the most extreme retained values. A formal definition can be found at (Yuen 1974, 166).

```
x <- c(2:99,1000,10000) # Original vector, containing extreme values
(xw <- DescTools::Winsorize(x, probs = c(0.01, 0.99)))
```

```
## [1] 2.99 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00
## [15] 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00
## [29] 30.00 31.00 32.00 33.00 34.00 35.00 36.00 37.00 38.00 39.00 40.00 41.00 42.00 43.00
## [43] 44.00 45.00 46.00 47.00 48.00 49.00 50.00 51.00 52.00 53.00 54.00 55.00 56.00 57.00
## [57] 58.00 59.00 60.00 61.00 62.00 63.00 64.00 65.00 66.00 67.00 68.00 69.00 70.00 71.00
## [71] 72.00 73.00 74.00 75.00 76.00 77.00 78.00 79.00 80.00 81.00 82.00 83.00 84.00 85.00
## [85] 86.00 87.00 88.00 89.00 90.00 91.00 92.00 93.00 94.00 95.00 96.00 97.00 98.00 99.00
## [99] 1000.00 1090.00
```

`## [1] 159.49`

`## [1] 70.3999`

### 2.3.3 Total

*Total* is the sum of all values of a variable. It is expressed by the equations (2.11) and (2.12).

\[\begin{equation} \tau = \sum_{i=1}^N x_i \tag{2.11} \end{equation}\]

\[\begin{equation} \hat{\tau} = N \bar{x}_{n}, \tag{2.12} \end{equation}\]

where \(\bar{x}_{n}\) is the *sample mean*, presented in Equation (2.9).

**Example 2.22 **(Total) Reassume the data from Example 2.21. If someone needs a trash can 60 times in the capital of Rio Grande do Sul, it is estimated that the total number of steps to be walked is \[\hat{\tau} = \frac{60}{6} \times 930 = 60 \times 155 = 9300\]

```
N <- 60 # Universe/population size
x <- c(186,402,191,20,7,124) # Raw data
N*mean(x) # Equation (2.11)
```

`## [1] 9300`

### 2.3.4 Mean Square

The *mean square* is the mean of the squared values, used in the calculation of variances.
\[\begin{equation}
MS = \frac{\sum_{i=1}^n x_{i}^{2}}{n}.
\tag{2.13}
\end{equation}\]

The *root mean square* (RMS) is the square root of the mean square.
\[\begin{equation}
RMS=\sqrt{MS}.
\tag{2.14}
\end{equation}\]

**Example 2.23 **(MS and RMS) The mean square of the values 186, 402, 191, 20, 7 and 124 is \[MS = \frac{\sum_{i=1}^6 x_{i}^{2}}{6} = \frac{186^2+402^2+191^2+20^2+7^2+124^2}{6} = \frac{248506}{6} = 41417.\bar{6}.\] The RMS (root mean square) is \[RMS = \sqrt{41417.\bar{6}} \approx 203.5133.\]

`## [1] 41417.66667`

`## [1] 203.5133083`

### 2.3.5 Mode

*Mode(s)* is (are) the most frequent value(s) in a distribution. When there is only one mode, the distribution is known as *unimodal*. If there are two modes, the distribution is *bimodal*. Three modes configure a *trimodal* distribution, and four or more modes indicate a *multimodal* distribution. Distributions with equivalent frequencies for all values are said to be *amodal*. When data are grouped, the *modal class* must be indicated, i.e., the class with the highest frequency. The computational effort to calculate the mode is to perform a count.

In R there is the `Mode`

function from the `pracma`

package, but it only works well in the unimodal case. Therefore, the `Modes`

function is presented below, adapted from the suggestion by digEmAll in this StackOverflow discussion. The following examples compare the two approaches.

```
# Modes function
Modes <- function(x) {
ux <- sort(unique(x))
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
```

**Example 2.24 **(Unimodal) The mode of the data set 4, 7, 1, 3, 3, 9 is \(Mo=3\), as it has a frequency of 2 while the other values have a frequency of 1. This is a unimodal distribution.

`## [1] 3`

`## [1] 3`

**Example 2.25 **(Bimodal) The modes of the data set 4, 7, 1, 3, 3, 9, 7 are \(Mo'=3\) and \(Mo''=7\), as both have frequency 2 while the other values have frequency 1. The order of presentation is indifferent. This is a bimodal distribution.

`## [1] 3 7`

`## [1] 3`

**Example 2.26 **(Amodal) The data set 4, 7, 1, 3, 9 is said to be *amodal* because all values have frequency 1.

`## [1] 1 3 4 7 9`

`## [1] 1`

**Example 2.27 **(Mode for grouped data) In the Example 2.18 it is observed that \(f_{3}=41\) is the highest frequency. The modal class is therefore the third, comprised between the values 1.60 and 1.65.

### 2.3.6 Quantile

Quantiles are measures that divide an ordered set of data into \(k\) equal parts. The basic method consists of obtaining a roll of data and finding (albeit approximately) the values that divide the distribution according to the desired \(k\). The computational effort to calculate any separatrix is, therefore, the sorting of the data. In general, a separatrix \(S\) can be defined according to Eq. (2.15), where \(n\) indicates the number of observations and \(p\) the proportion of observations ordered below \(S\).

\[\begin{equation} S = x_{(p(n+1))} \tag{2.15} \end{equation}\]

The `stats::quantile`

function has nine methods for obtaining quatiles, so the documentation is recommended for more details. With it, you can easily obtain the desired quantiles, just by adjusting the \(p\) argument. Note that the function returns the quantiles expressed in percentiles, where \(0\%\) equals the minimum and \(100\%\) the maximum.

#### Median (\(k=2\))

The *median* is the measure that divides half of the sorted data (list) to its left and the other half to its right, i.e., it is the central measure in terms of sorting. Its position is the average between the first and last positions.

\[\begin{equation}
Pos = \frac{1+n}{2}
\tag{2.16}
\end{equation}\]

**Example 2.28 **The *median* is the measure that divides half of the sorted data (list) to its left and the other half to its right, i.e., it is the central measure in terms of sorting. It can be defined by Eq. (2.17).

\[\begin{equation} Md = x_{\left( \frac{1}{2} (n+1) \right)} \tag{2.17} \end{equation}\]

`## [1] 50`

```
## 50%
## 50
```

**Example 2.29 **(Median for \(n\) odd) Let the data set be 10, -4, 11, 12, 1, 5, 15, formed by \(n=7\) values. When ordered, we obtain the list -4, 1, 5, 10, 11, 12, 15. Considering \(k=2\), we obtain the quantile \(Md=10\), as it divides the set into two parts of the same size (three values below the median 10 and three values above). Its position is given by \(Pos=\frac{1+7}{2}=4\).

`## [1] 7`

`## [1] 4`

`## [1] -4 1 5 10 11 12 15`

`## [1] 10`

**Example 2.30 **(Median for \(n\) even) When the number of observations is even, just take the average of the two central values of the roll. Let the data set be 15, -4, 11, 12, 1, 5, formed by \(n=6\) values. When ordered, we obtain the list -4, 1, 5, 11, 12, 15. Considering again \(k=2\), we obtain the quantile \(Md=\frac{5+11}{2}=8\), because it divides the set into two parts of the same size (three values below 8 and three values above). Its position is given by \(Pos=\frac{1+6}{2}=3.5\), i.e., the median is an intermediate value between the third and fourth positions.

`## [1] 6`

`## [1] 3.5`

`## [1] -4 1 5 11 12 15`

`## [1] 8`

**Example 2.31 **The first and third quartiles can be defined respectively by Eq. (2.18) and (2.19).

\[\begin{equation} Q_1 = x_{\left( \frac{1}{4} (n+1) \right)} \tag{2.18} \end{equation}\]

\[\begin{equation} Q_3 = x_{\left( \frac{3}{4} (n+1) \right)} \tag{2.19} \end{equation}\]

```
## 25% 75%
## 25 75
```

A dataset can be divided into \(k\) sectors, the main ones being shown in the following table

\(k\) | \(p\) | Name | Symbol |
---|---|---|---|

2 | 1/2 | Median | Md |

3 | 1/3, 2/3 | Tertile | \(T_1\), \(T_2\) |

4 | 1/4, 2/4, 3/4 | Quartile | \(Q_1\), \(Q_2\), \(Q_3\) |

10 | 1/10, …, 9/10 | Decile | \(D_1\), \(D_2\), \(\ldots\), \(D_9\) |

100 | 1/100, …, 99/100 | Percentile | \(P_1\), \(P_2\), \(\ldots\), \(P_{99}\) |

**Example 2.32 **Some quantiles.

```
h <- read.csv('https://filipezabala.com/data/hospital.csv')
options(digits = 4) # To improve the presentation
quantile(h$height, probs = seq(0, 1, 1/2)) # Median
```

```
## 0% 50% 100%
## 1.510 1.625 1.740
```

```
## 0% 33.33333% 66.66667% 100%
## 1.51 1.61 1.65 1.74
```

```
## 0% 25% 50% 75% 100%
## 1.510 1.598 1.625 1.650 1.740
```

```
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1.510 1.569 1.590 1.600 1.616 1.625 1.640 1.650 1.660 1.680 1.740
```

**Exercise 2.9 **Consider the separatrixes discussed in this section.

a. Check that the median (Md), second quartile (\(Q_2\)) separators are equivalent.

b. Are there other measures equivalent to those in item (a)? Justify.

c. Consider some \(k\) different from those presented and assign a name and symbology.

d. If there are \(k\) ‘slices’, how many quantiles are there?

Suggestion: Chapter **??**
\(\\\)

**Exercise 2.10 **Using the `quantile`

function calculate the separatrixes discussed in this Section with the `children`

column data available in https://filipezabala.com/data/hospital.csv.

Suggestion: Chapter **??**

### 2.3.7 5-number summary

The *5-number summary* was suggested by (Tukey 1977). It includes minimum, maximum, median and lower and upper *hinges*. We will refer to the *lower hinge* as the median between the minimum and the median of the entire set. The *upper hinge* is the median between the median of the entire set and the maximum. Depending on the algorithm used to calculate the quartiles, the *hinges* may differ slightly from these quantiles.

**Example 2.33 **Consider the dataset used by (Tukey 1977, 33).

`## [1] -3.2 0.1 1.5 3.0 9.8`

```
## 0% 25% 50% 75% 100%
## -3.2 0.1 1.5 3.0 9.8
```

### References

*Extreme Value Distributions*. World Scientific. https://books.google.com.br/books/about/Extreme_Value_Distributions.html?id=ZPW3CgAAQBAJ&redir_esc=y.

*Exploratory Data Analysis*. Addison-Wesley Publishing Company.

*Biometrika*61 (1): 165–70. https://www.jstor.org/stable/2334299.