7.1 Correlation

Correlation is a measure of the degree of linear association between two random variables. In common sense it has a wide range of meanings, but the term usually refers to the Pearson correlation coefficient, detailed in Section 7.1.2.

Example 7.1 The measurement called fundamental frequency, or pitch may indicate the frequency in Hz of the vibration of the vocal cords. It is widely reported in specialized literature, including the algorithms of (Schäfer-Vincent 1983) and (Jackson 1995). Using the package voice (Zabala 2022) it is possible to extract these two phonetic features, compared below.

wavFiles <- list.files(system.file('extdata', package = 'wrassp'),
                       pattern <- glob2rx('*.wav'), full.names = TRUE)
M <- voice::extract_features(wavFiles, features = c('f0','f0_mhs'))
M <- M[M$f0_mhs < 125 | is.na(M$f0_mhs),] # removing anomalous points
plot(M$f0, M$f0_mhs, las = 1)
txt <- paste('r =', round(cor(M$f0, M$f0_mhs, use = 'pair'), 3))
legend(55, 110, txt, bty = 'n')

Exercise 7.1

  1. Review the documentation for the anscombe object.
  2. Access the links datasauRus e https://marcusnunes.me/posts/correlacoes-e-graficos-de-dispersao/. \(\\\)

Exercise 7.2 Access the link about Bivariate Correlation Simulation and Bivariate Scatter Plotting and carry out simulations varying the parameters, observing the resulting graphs.

7.1.1 \(\rho\) \(\cdot\) Universal correlation

(Samuel Kotz et al. 2005, 1375) define the universal correlation of two r.v. \(X\) and \(Y\) by \[\begin{equation} \rho = Cor(X,Y) = \dfrac{Cov(X,Y)}{D(X) D(Y)}, \tag{7.1} \end{equation}\] where \[\begin{equation} Cov(X,Y) = E \left[ (X-E(X)) (Y-E(Y)) \right] = \frac{\sum_{i=1}^N (x_{i} - \mu_{X})(y_{i} - \mu_{Y})}{N} \tag{7.2} \end{equation}\] is the covariance between \(X\) and \(Y\), \(D(X)\) and \(D(Y)\) are respectively the universal standard deviations of \(X\) and \(Y\) according to Eq. (2.26) e \[\begin{equation} -1 \le \rho \le +1 \end{equation}\] \[ \Updownarrow \] \[\begin{equation} 0 \le |\rho| \le +1. \end{equation}\]

If \(|\rho| = +1\), then there is a linear relationship of the form \(Y = \beta_0 + \beta_1 X\). If \(\rho = +1\), \(\beta_1 >0\); if \(\rho = -1\), \(\beta_1 <0\). If \(X\) is independent of \(Y\), then \(\rho = 0\), but the opposite is not necessarily true.

7.1.2 \(r\) \(\cdot\) Pearson Correlation Coefficient

The Pearson correlation coefficient was developed by Karl Pearson and William Fleetwood Sheppard “before the end of 1897”, as reported by (Pearson 1920, 44). Denoted by \(r\), it can be obtained by any of the following equations.

\[\begin{equation} r = \frac{1}{n-1} \sum_{i=1}^{n} \left( \dfrac{x_{i} - \bar{x}}{s_{x}} \right) \left( \dfrac{y_{i} - \bar{y}}{s_{y}} \right) \tag{7.3} \end{equation}\]

\[\begin{equation} r = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sqrt{\sum (x - \bar{x})^2 \sum (y - \bar{y})^2}} \tag{7.4} \end{equation}\]

\[\begin{equation} r = \frac{n \sum{x y} - \sum{x} \sum{y}}{\sqrt{[n \sum{x^2} - (\sum{x})^2] [n \sum{y^2} - (\sum{y})^2]}} \tag{7.5} \end{equation}\]

where \[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_{i}, \;\; s_{x}^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_{i} - \bar{x}) ^2 \] \[ \bar{y} = \frac{1}{n} \sum_{i=1}^{n} y_{i}, \;\; s_{y}^2 = \frac{1}{n-1} \sum_{i=1}^{n} (y_{i} - \bar{y}) ^2 \] \[ -1 \le r \le +1 \]

Note from Equation (7.3) that \(r\) is an average of the products of the ordered pairs \((x_{i}, y_{i})\) standardized, with \(i \in \lbrace 1, 2 , \ldots, n \rbrace\). If positive product pairs predominate, \(r\) will be positive; if there is a predominance of negative product pairs, \(r\) will be negative. This structure is called product moment. The Equation (7.4) refers to the definition (7.1), while the Equation (7.5) is useful for carrying out the calculations.

Example 7.2 Consider the idea of estimating the number of beverage bottles to be chilled depending on the maximum temperature of the day. Let \(X\): ‘maximum temperature of the day in \(^{\circ}{\rm C}\)’ and \(Y\): ‘number of bottles of drink consumed’, observed according to the following table.

\(i\) \(x_{i}\) \(y_{i}\) \(i\) \(x_{i}\) \(y_{i}\) \(i\) \(x_{i}\) \(y_{i}\)
1 29.5 146 11 28.5 183 21 40.9 233
2 31.3 170 12 28.0 158 22 28.6 169
3 34.7 167 13 36.7 181 23 36.1 192
4 40.4 244 14 31.5 123 24 27.1 106
5 28.4 159 15 38.1 223 25 29.5 170
6 40.3 195 16 33.5 176 26 31.6 167
7 41.1 225 17 37.2 196 27 25.2 133
8 36.2 206 18 41.9 238 28 31.5 138
9 35.7 200 19 31.5 184 29 39.8 199
10 26.1 134 20 38.2 213 30 30.8 172

A scatter plot can help explore the behavior of temperature and bottles consumed.

dr <- read.table('https://filipezabala.com/data/drinks.txt', header = T)
dim(dr)  # dataset dimension
## [1] 30  2
head(dr) # first part of dataset
##   temp gar
## 1 29.5 146
## 2 31.3 170
## 3 34.7 167
## 4 40.4 244
## 5 28.4 159
## 6 40.3 195
plot(dr$temp, dr$gar)

The degree of alignment of variables can be estimated by Pearson’s correlation coefficient, simply calculating \[ \sum{x} = 1010, \; \sum{x}^2 = 34730, \] \[ \sum{y} = 5400, \; \sum{y}^2 = 1006954, \] \[ \sum{x y} = 186117, \; n=30 \]

and replace in Equation (7.5), resulting in

\[\begin{equation} r = \frac{30 \times 186117 - 1010 \times 5400}{\sqrt{[30 \times 34730 - 1010^2] [30 \times 1006954 - 5400^2]}} \nonumber \\ r = \frac{130056}{\sqrt{21988.49 \times 1048620}} \nonumber \\ r \approx 0.8565. \nonumber \end{equation}\]

cor(dr$temp, dr$gar)
## [1] 0.8564926

It is possible to obtain the value of \(r\) from the stats::lm function. It calculates the coefficient of determination \(r^2\) according to Section 7.1.4, which in the RLS model (Section 7.2) is the square of \(r\). Therefore \(r=\sqrt{r^2}\).

f <- lm(dr$gar ~ dr$temp)                   # SLR model
r2 <- summary(f)$r.squared                  # Extracting r2
sqrt(r2)                                    # r, Pearson correlation
## [1] 0.8564926
all.equal(cor(dr$temp, dr$gar), sqrt(r2))   # Compating r via stats::cor and stats::lm
## [1] TRUE

Exercise 7.3 Calculate the value of \(r\) with the data from Example 7.2 from Equations (7.3) and (7.4). Compare with the values obtained via Eq. (7.5) and the stats::cor function.

Test for \(\rho\)

The basic test is to compare \(\rho\) with zero, which indicates a complete lack of alignment between the variables. Thus, we test \(H_0: \rho = 0\) (there is no correlation) vs \(H_1: \rho \ne 0\) (there is correlation), denoted by

\[ \left\{ \begin{array}{l} H_0: \rho = 0 \\ H_1: \rho \ne 0\\ \end{array} \right. . \]

Under \(H_0\) considering the SLR model (Section 7.2), \[\begin{equation} T = \frac{r}{\sqrt{\frac{1-r^2}{n-2}}} \sim t_{n-2}. \tag{7.6} \end{equation}\]

Exercise 7.4 From Eq. (7.6), show that \(T = r \sqrt{\frac{n-2}{1-r^2}}\).

Example 7.3 (Checking alignment in the SLR model) Consider again the information presented in Example 7.2. Considering the model estimate \(y = -19.11 + 5.91 x\), one can test \(H_0: \rho = 0\) vs \(H_1: \rho \ne 0\). In this case \(gl=30-2=28\), \(T \sim t_{28}\) and under \(H_0\) \[ T = 0.8565 \sqrt{\frac{30-2}{1-0.8565^2}} \approx 8.78. \]

n <- 30
gl <- n-2
(r <- cor(dr$temp, dr$gar))
## [1] 0.8564926
(Tt <- r*sqrt(gl/(1-r^2)))
## [1] 8.780493
(p_value <- 2*pt(-abs(Tt), gl))
## [1] 1.565265e-09
cor.test(dr$temp, dr$gar)         # Function that performs the hypothesis test
## 
##  Pearson's product-moment correlation
## 
## data:  dr$temp and dr$gar
## t = 8.7805, df = 28, p-value = 1.565e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7176748 0.9298423
## sample estimates:
##       cor 
## 0.8564926
all.equal(r, sqrt(r2))            # Comparing r (calculated) and r2 (via stats::lm)
## [1] TRUE

CI for \(\rho\)

Step 1: Obtain Fisher’s z transformation (Ronald A. Fisher 1921, 210).

\[\begin{equation} z_r = \frac{1}{2} \ln \left( \frac{1+r}{1-r} \right) = \tanh^{-1}(r) \tag{7.7} \end{equation}\]

Step 2: Get the transformed lower and upper bounds with the desired confidence \(1-\alpha\).

\[\begin{equation} L = z_r - \frac{z_{1-\alpha/2}}{\sqrt{n-3}} \tag{7.8} \end{equation}\]

\[\begin{equation} U = z_r + \frac{z_{1-\alpha/2}}{\sqrt{n-3}} \tag{7.9} \end{equation}\]

Step 3: Obtain the confidence interval \(1-\alpha\) for \(\rho\).

\[\begin{equation} IC[\rho, 1-\alpha] = [ \frac{e^{2L}-1}{e^{2L}+1}, \frac{e^{2U}-1}{e^{2U}+1} ] = [ \tanh L, \tanh U ] \tag{7.10} \end{equation}\]

Example 7.4 Considering the data from Example 7.3 the 95% CI for the correlation can be obtained.

# Step 1
(zr <- atanh(r))
## [1] 1.280029
# Step 2
(ep <- 1.96/sqrt(n-3))
## [1] 0.3772022
(L <- zr-ep); (U <- zr+ep)
## [1] 0.9028267
## [1] 1.657231
# Step 3
tanh(L); tanh(U)
## [1] 0.7176715
## [1] 0.9298433

Exercise 7.5

  1. Check the identity of Equations (7.7) and (7.10). Apply the alternative equations to the data in Example 7.4 and compare with the output of the stats::cor.test function.
  2. Get \(CI[ \rho, 80\% ]\).
  3. Write a function that returns the IC for \(\rho\) as a function of two vectors and an arbitrary confidence.

7.1.3 \(\rho_{RTO}\) and \(r_{RTO}\) \(\cdot\) Correlation in RTO

There is a special case of correlation calculation called Regression through the Origin (RTO), based on the model according to Eq. (7.17). In this case, \(r_{RTO}\) is calculated using the expression

\[\begin{equation} r_{RPO} = \sqrt{\dfrac{\sum \hat{y}_{i}^2}{\sum y_{i}^2}}, \tag{7.11} \end{equation}\]

such that \(\hat{y}_{i} = \hat{\beta}_1^{RPO}x_i\).

Example 7.5 (Correlation in RTO) Consider the information from Example 7.2. It can be calculated \[ r_{RTO} = \sqrt{\dfrac{997410}{1006954}} = \sqrt{0.9905} \approx 0.9952. \]

f <- lm(gar ~ temp-1, data = dr)
yhat <- sum(f$fitted.values^2)
y2 <- sum(dr$gar^2)
yhat/y2
## [1] 0.9905222

It is possible to obtain the value of \(r_{RTO}\) from the stats::lm function. It calculates the coefficient of determination \(r_{RTO}^2\), which in the RPO model is the square of \(r_{RTO}\). Therefore \(r_{RTO}=\sqrt{r_{RTO}^2}\).

f_rto <- lm(dr$gar ~ dr$temp - 1)         # RTO model
r2_rto <- summary(f_rto)$r.squared        # Function to get r2_RTO
sqrt(r2_rto)                              # r_RTO
## [1] 0.9952498
all.equal(sqrt(yhat/y2), sqrt(r2_rto))    # Comparing r_rto (calculated) and r2_rto (via stats::lm)
## [1] TRUE

Exercise 7.6 Consider again the information presented in Example 7.2. a. From the Example 7.9 it is known that \(\hat{y}_{i} = 5.36 x_i\), such that $x_i {29.5, 31.3, , 30.8 } $. Show that \(\sum \hat{y}_{i}^2 = 997410\).
b. Inspect the f$fitted.values object from Example 7.5.

Test for \(\rho_{RPO}\)

If we consider the RPO model according to Eq. (7.17), we can test the hypotheses

\[ \left\{ \begin{array}{l} H_0: \rho_{RTO} = 0 \\ H_1: \rho_{RTO} \ne 0\\ \end{array} \right. . \]

Under \(H_0\), the test statistic is \[\begin{equation} T_{RTO} = \frac{r_{RTO}}{\sqrt{\frac{1-r_{RTO}^2}{n-1}}} \sim t_{n-1}. \tag{7.12} \end{equation}\]

Example 7.6 (Checking alignment in the RTO model) Consider again the information presented in Example 7.2. If the RTO model estimate is \(y = 5.36 x\), one can test

\[ \left\{ \begin{array}{l} H_0: \rho^{RTO} = 0 \\ H_1: \rho^{RTO} \ne 0\\ \end{array} \right. \]

In this case \(df=30-1=29\), \(T \sim t_{29}\) and under \(H_0\) \[ T = 0.9952 \sqrt{\frac{30-1}{1-0.9952^2}} \approx 55.05. \]

n <- 30
gl_rpo <- n-1
(r_rpo <- sqrt(yhat/y2))
## [1] 0.9952498
(Tt_rpo <- r_rpo*sqrt(gl_rpo/(1-r_rpo^2)))
## [1] 55.05267
(p_value <- 2*pt(-abs(Tt_rpo), gl_rpo))
## [1] 6.778938e-31

CI for \(\rho_{RPO}\)

7.1.4 \(r^2\) \(\cdot\) Coefficient of determination

References

Fisher, Ronald A. 1921. “014: On the "Probable Error" of a Coefficient of Correlation Deduced from a Small Sample.” https://www.jstor.org/stable/2331802.
Jackson, Jeffrey C. 1995. The Harmonic Sieve: A Novel Application of Fourier Analysis to Machine Learning Theory and Practice.” Carnegie-Mellon University Pittsburgh PA Schoo of Computer Science. https://apps.dtic.mil/sti/pdfs/ADA303368.pdf.
Kotz, Samuel, Narayanaswamy Balakrishnan, Campbell B Read, and Brani Vidakovic. 2005. Encyclopedia of Statistical Sciences, Volume 2. John Wiley & Sons.
———. 1920. “Notes on the History of Correlation.” Biometrika 13 (1): 25–45. https://www.jstor.org/stable/2331722.
Schäfer-Vincent, Kurt. 1983. “Pitch Period Detection and Chaining: Method and Evaluation.” Phonetica 40 (3): 177–202. https://doi.org/10.1159/000261691.
———. 2022. voice: Tools for Voice Analysis, Speaker Recognition and Mood Inference. https://CRAN.R-project.org/package=voice.