9.2 Análise de Correlação Canônica

[T]he problem of finding a linear function of the criterion variates which can most accurately be predicted from given observations, in the sense of least squares, admits a definite solution, which we shall set forth. (Hotelling 1935, 1)

(Hotelling 1935) deu as bases da análise de correlação canônica, ainda que (Jordan 1875) já tivesse estudado o conceito de ângulos canônicos ou principais, i.e., ângulos entre linhas e planos no espaço.

If there are \(p\) variates which may be called “predicters”, such as scores in college entrance examinations of various kinds, and \(q\) others which will be called the “criteria”, such as college grades, of which some function to be chosen is to be predicted, let us denote these variates by \(x_i\) and \(x_{\alpha}\) respectively. (Hotelling 1935, 1)

Considere por simplicidade \(X \equiv x_i\), \(Y \equiv x_{\alpha}\). Deseja-se encontrar as combinações lineares

\[\begin{equation} U = a^{T}X \tag{9.2} \end{equation}\]

\[\begin{equation} V = b^{T}Y \tag{9.3} \end{equation}\]

tais que a correlação entre \(U\) e \(V\) seja máxima, chamado primeiro par de variáveis canônicas. Ao se maximizar novamente a correlação, porém restrito a ser não correlacionado com \(U\) e \(V\), obtém-se o segundo par de variáveis canônicas.

9.2.1 `stats::cancor`

The canonical correlation analysis seeks linear combinations of the y variables which are well explained by linear combinations of the x variables. The relationship is symmetric as ‘well explained’ is measured by correlations. Documentação da função stats::cancor.

Exemplo 9.3 A partir dos dados iris dividimos os dados em sépala e pétala.

## signs of results are random
X <- iris[, 1:2] # sépala
Y <- iris[, 3:4] # pétala
(cc_fit <- stats::cancor(X, Y))

## $cor
## [1] 0.9409690 0.1239369
## 
## $xcoef
##                     [,1]       [,2]
## Sepal.Length -0.08757435 0.04749411
## Sepal.Width   0.07004363 0.17582970
## 
## $ycoef
##                     [,1]       [,2]
## Petal.Length -0.06956302 -0.1571867
## Petal.Width   0.05683849  0.3940121
## 
## $xcenter
## Sepal.Length  Sepal.Width 
##     5.843333     3.057333 
## 
## $ycenter
## Petal.Length  Petal.Width 
##     3.758000     1.199333

colMeans(iris[-5])

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

Calculando o primeiro e segundo pares de variáveis canônicas.

# primeiro par de variáveis canônicas
CC1_X <- as.matrix(X) %*% cc_fit$xcoef[, 1]
CC1_Y <- as.matrix(Y) %*% cc_fit$ycoef[, 1]
# segundo par de variáveis canônicas
CC2_X <- as.matrix(X) %*% cc_fit$xcoef[, 2]
CC2_Y <- as.matrix(Y) %*% cc_fit$ycoef[, 2]
# verificando as correlações
cor(CC1_X, CC1_Y)

##          [,1]
## [1,] 0.940969

cor(CC2_X, CC2_Y)

##           [,1]
## [1,] 0.1239369

# gráficos
plot(CC1_X, CC1_Y, col = iris$Species)

plot(CC2_X, CC2_Y, col = iris$Species)

Exemplo 9.4 A partir dos dados LifeCycleSavings discutidos por (Belsley, Kuh, and Welsch 2004, 39) e no Exemplo 10.12, é feita uma análise de correlação canônica entre variáveis pessoais (sr: ‘poupança pessoal’, dpi: ‘renda disponível per capita’ e ddpi: ‘taxa de crescimento da renda disponível per capita’) e demográficas (pop15: ‘proporção da população com menos de 15 anos’ e pop75: ‘proporção da população com mais de 75 anos’).

## signs of results are random
X <- LifeCycleSavings[, 2:3] # demográficas
Y <- LifeCycleSavings[, -(2:3)] # pessoais
(cc_fit <- stats::cancor(X, Y))

## $cor
## [1] 0.8247966 0.3652762
## 
## $xcoef
##               [,1]        [,2]
## pop15 -0.009110856 -0.03622206
## pop75  0.048647514 -0.26031158
## 
## $ycoef
##              [,1]          [,2]          [,3]
## sr   0.0084710221  3.337936e-02 -5.157130e-03
## dpi  0.0001307398 -7.588232e-05  4.543705e-06
## ddpi 0.0041706000 -1.226790e-02  5.188324e-02
## 
## $xcenter
##   pop15   pop75 
## 35.0896  2.2930 
## 
## $ycenter
##        sr       dpi      ddpi 
##    9.6710 1106.7584    3.7576

colMeans(LifeCycleSavings)

##        sr     pop15     pop75       dpi      ddpi 
##    9.6710   35.0896    2.2930 1106.7584    3.7576

Calculando o primeiro e segundo pares de variáveis canônicas.

# primeiro par de variáveis canônicas
CC1_X <- as.matrix(X) %*% cc_fit$xcoef[, 1]
CC1_Y <- as.matrix(Y) %*% cc_fit$ycoef[, 1]
# segundo par de variáveis canônicas
CC2_X <- as.matrix(X) %*% cc_fit$xcoef[, 2]
CC2_Y <- as.matrix(Y) %*% cc_fit$ycoef[, 2]
# verificando as correlações
cor(CC1_X, CC1_Y)

##           [,1]
## [1,] 0.8247966

cor(CC2_X, CC2_Y)

##           [,1]
## [1,] 0.3652762

# gráficos
plot(CC1_X, CC1_Y)

plot(CC2_X, CC2_Y)

9.2.2 `CCA::cc`

(González and Déjean 2021)

library(CCA)

X <- iris[, 1:2] # sépala
Y <- iris[, 3:4] # pétala
cc_fit <- CCA::cc(X, Y)
# verificando as correlações
cor(cc_fit$scores$xscores[,1], 
    cc_fit$scores$yscores[,1])

## [1] 0.940969

# gráficos
plot(cc_fit$scores$xscores[,1], 
     cc_fit$scores$yscores[,1], col = iris$Species)

CCA::plt.cc(cc_fit)

9.2.3 `CCP`

(Menzel 2022) apresenta um conjunto de funções para realização de testes de significância para Análise de Correlação Canônica.

## Load the CCP package:
library(CCP)

## Simulate example data:
X <- matrix(rnorm(150), 50, 3)
Y <- matrix(rnorm(250), 50, 5)

## Calculate canonical correlations:
rho <- cancor(X,Y)$cor

## Define number of observations, 
## and number of dependent and independent variables:
N = dim(X)[1]       
p = dim(X)[2]   
q = dim(Y)[2]

## Calculate p-values using F-approximations of some test statistics:
p.asym(rho, N, p, q, tstat = "Wilks")

## Wilks' Lambda, using F-approximation (Rao's F):
##               stat    approx df1      df2   p.value
## 1 to 3:  0.6467238 1.3265272  15 116.3449 0.1974814
## 2 to 3:  0.9148378 0.4892261   8  86.0000 0.8607918
## 3 to 3:  0.9918161 0.1210210   3  44.0000 0.9472451

p.asym(rho, N, p, q, tstat = "Hotelling")

##  Hotelling-Lawley Trace, using F-approximation:
##                 stat    approx df1 df2   p.value
## 1 to 3:  0.506968370 1.3744476  15 122 0.1704208
## 2 to 3:  0.092395599 0.4927765   8 128 0.8595588
## 3 to 3:  0.008251433 0.1228547   3 134 0.9464832

p.asym(rho, N, p, q, tstat = "Pillai")

##  Pillai-Bartlett Trace, using F-approximation:
##                 stat    approx df1 df2   p.value
## 1 to 3:  0.378870129 1.2719923  15 132 0.2286221
## 2 to 3:  0.085797351 0.5078591   8 138 0.8488639
## 3 to 3:  0.008183905 0.1313007   3 144 0.9413299

p.asym(rho, N, p, q, tstat = "Roy")

##  Roy's Largest Root, using F-approximation:
##               stat  approx df1 df2     p.value
## 1 to 1:  0.2930728 3.64824   5  44 0.007566995
## 
##  F statistic for Roy's Greatest Root is an upper bound.

## Plot the F-approximation for Wilks' Lambda, 
## considering 3, 2, or 1 canonical correlation(s):
res1 <- p.asym(rho, N, p, q)

## Wilks' Lambda, using F-approximation (Rao's F):
##               stat    approx df1      df2   p.value
## 1 to 3:  0.6467238 1.3265272  15 116.3449 0.1974814
## 2 to 3:  0.9148378 0.4892261   8  86.0000 0.8607918
## 3 to 3:  0.9918161 0.1210210   3  44.0000 0.9472451

plt.asym(res1,rhostart=1)

plt.asym(res1,rhostart=2)

plt.asym(res1,rhostart=3)

Referências

Belsley, David A, Edwin Kuh, and Roy E Welsch. 2004. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons.

González, Ignacio, and Sébastien Déjean. 2021. CCA: Canonical Correlation Analysis. https://CRAN.R-project.org/package=CCA.

———. 1935. “The Most Predictable Criterion.” Journal of Educational Psychology 26 (2): 139. https://psycnet.apa.org/doi/10.1037/h0058165.

Jordan, Camille. 1875. “Essai Sur La géométrie à \(n\) Dimensions.” Bulletin de La Société Mathématique de France 3: 103–74. https://doi.org/10.24033/bsmf.90.

Menzel, Uwe. 2022. CCP: Significance Tests for Canonical Correlation Analysis (CCA). https://CRAN.R-project.org/package=CCP.