9.5 Modelos Lineares
A classe de Modelos Lineares atende a uma ampla gama de problemas aplicados, apresentada em profundidade por (Neter et al. 2005). Este modelos são casos particulares das classes de Modelos Lineares Generalizados (MLG/GLM) (McCullagh and Nelder 1989), Modelos Aditivos Generalizados (MAG/GAM) (Hastie and Tibshirani 1986), dentre outras. (J. H. Friedman and Stuetzle 1981) sugerem a Regressão com Perseguição da Projeção, procedimento que modela a superfície de regressão como uma soma de funções de suavização de combinações lineares das variáveis preditoras de forma iterativa. (O’Hagan and Forster 1994, 244–76) e (Paulino, Turkman, and Murteira 2003, 194–208) fornecem uma abordagem bayesiana.
9.5.1 Clássico
Exemplo 9.8 (Belsley, Kuh, and Welsch 2004, 39) analisam a hipótese desenvolvida por Franco Modigliani (1975), de que a a razão de poupança ao longo da vida (poupança pessoal agregada dividido pelo rendimento disponível) é explicada pelo rendimento disponível per capita, a taxa de variação percentual do rendimento disponível per capita e duas variáveis demográficas: a proporção da população com menos de 15 anos e a proporção da população com mais de 75 anos. Os dados são calculados ao longo da década de 1960-1970 para remover o ciclo de negócios ou outras flutuações de curto prazo.
##
## Shapiro-Wilk normality test
##
## data: LifeCycleSavings$sr
## W = 0.98074, p-value = 0.5836
##
## Call:
## lm(formula = sr ~ ., data = LifeCycleSavings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2422 -2.6857 -0.2488 2.4280 9.7509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.5660865 7.3545161 3.884 0.000334 ***
## pop15 -0.4611931 0.1446422 -3.189 0.002603 **
## pop75 -1.6914977 1.0835989 -1.561 0.125530
## dpi -0.0003369 0.0009311 -0.362 0.719173
## ddpi 0.4096949 0.1961971 2.088 0.042471 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared: 0.3385, Adjusted R-squared: 0.2797
## F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007904
##
## Call:
## lm(formula = sr ~ pop15 + pop75 + ddpi, data = LifeCycleSavings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2539 -2.6159 -0.3913 2.3344 9.7070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.1247 7.1838 3.915 0.000297 ***
## pop15 -0.4518 0.1409 -3.206 0.002452 **
## pop75 -1.8354 0.9984 -1.838 0.072473 .
## ddpi 0.4278 0.1879 2.277 0.027478 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.767 on 46 degrees of freedom
## Multiple R-squared: 0.3365, Adjusted R-squared: 0.2933
## F-statistic: 7.778 on 3 and 46 DF, p-value: 0.0002646
##
## Call:
## omcdiag(mod = mod, Inter = TRUE, detr = detr, red = red, conf = conf,
## theil = theil, cn = cn)
##
##
## Overall Multicollinearity Diagnostics
##
## MC Results detection
## Determinant |X'X|: 0.1739 0
## Farrar Chi-Square: 82.4971 1
## Red Indicator: 0.5254 1
## Sum of Lambda Inverse: 12.4857 0
## Theil's Method: 0.9827 1
## Condition Number: 31.7169 1
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
## pop15 pop75 ddpi
## 5.745478 5.736014 1.004186
Exercício 9.1 Considere o Exemplo 9.8.
a. A partir da saída de car::vif
, você recomenda retirar alguma variável do modelo? Em caso afirmativo, execute sua recomendação.
b. Teste jtools::summ(fm1)
.
Exemplo 9.9 Considere o banco de dados disponível em https://archive.ics.uci.edu/dataset/162/forest+fires baseado em (Cortez and Morais 2007).
X
- x-axis spatial coordinate within the Montesinho park map: 1 to 9Y
- y-axis spatial coordinate within the Montesinho park map: 2 to 9month
- month of the year: ‘jan’ to ‘dec’day
- day of the week: ‘mon’ to ‘sun’FFMC
- FFMC index from the FWI system: 18.7 to 96.20DMC
- DMC index from the FWI system: 1.1 to 291.3DC
- DC index from the FWI system: 7.9 to 860.6ISI
- ISI index from the FWI system: 0.0 to 56.10temp
- temperature in Celsius degrees: 2.2 to 33.30RH
- relative humidity in %: 15.0 to 100wind
- wind speed in km/h: 0.40 to 9.40rain
- outside rain in mm/m2 : 0.0 to 6.4area
- the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform).
## X Y month day FFMC DMC DC ISI temp RH wind rain area
## 1 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0
## 2 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0
## 3 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0
## 4 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0
## 5 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0
## 6 8 6 aug sun 92.3 85.3 488.0 14.7 22.2 29 5.4 0.0 0
##
## 1 2 3 4 5 6 7 8 9
## 48 73 55 91 30 86 60 61 13
## 'data.frame': 517 obs. of 13 variables:
## $ X : Ord.factor w/ 9 levels "1"<"2"<"3"<"4"<..: 7 7 7 8 8 8 8 8 8 7 ...
## $ Y : Ord.factor w/ 7 levels "2"<"3"<"4"<"5"<..: 4 3 3 5 5 5 5 5 5 4 ...
## $ month: chr "mar" "oct" "oct" "mar" ...
## $ day : chr "fri" "tue" "sat" "fri" ...
## $ FFMC : num 86.2 90.6 90.6 91.7 89.3 92.3 92.3 91.5 91 92.5 ...
## $ DMC : num 26.2 35.4 43.7 33.3 51.3 ...
## $ DC : num 94.3 669.1 686.9 77.5 102.2 ...
## $ ISI : num 5.1 6.7 6.7 9 9.6 14.7 8.5 10.7 7 7.1 ...
## $ temp : num 8.2 18 14.6 8.3 11.4 22.2 24.1 8 13.1 22.8 ...
## $ RH : int 51 33 33 97 99 29 27 86 63 40 ...
## $ wind : num 6.7 0.9 1.3 4 1.8 5.4 3.1 2.2 5.4 4 ...
## $ rain : num 0 0 0 0.2 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
##
## Call:
## lm(formula = area ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.15 -16.76 -5.70 4.57 1030.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.331145 77.828817 -0.004 0.99661
## X.L 23.833501 16.537260 1.441 0.15018
## X.Q 10.356880 16.359638 0.633 0.52699
## X.C 2.497455 13.957082 0.179 0.85806
## X^4 5.218877 12.455178 0.419 0.67540
## X^5 3.402785 10.728082 0.317 0.75124
## X^6 1.491091 10.510319 0.142 0.88724
## X^7 -4.756939 8.452755 -0.563 0.57386
## X^8 -15.109426 10.502914 -1.439 0.15092
## Y.L 36.793833 34.065026 1.080 0.28064
## Y.Q -13.635847 21.961665 -0.621 0.53497
## Y.C -85.495641 32.367580 -2.641 0.00853 **
## Y^4 -88.552976 38.881540 -2.278 0.02320 *
## Y^5 -74.363445 29.330619 -2.535 0.01155 *
## Y^6 -28.773038 14.662296 -1.962 0.05030 .
## monthaug 45.291708 39.079683 1.159 0.24705
## monthdec 47.250120 37.919912 1.246 0.21336
## monthfeb 10.241386 26.388330 0.388 0.69811
## monthjan 19.196597 56.733553 0.338 0.73524
## monthjul 32.034695 33.938181 0.944 0.34569
## monthjun 4.107370 31.059969 0.132 0.89485
## monthmar -2.043844 23.713037 -0.086 0.93135
## monthmay 12.867123 51.161618 0.251 0.80154
## monthnov 0.142191 68.832738 0.002 0.99835
## monthoct 73.076570 46.567582 1.569 0.11725
## monthsep 74.178671 43.456313 1.707 0.08848 .
## daymon 5.777937 10.629021 0.544 0.58697
## daysat 18.616813 10.195896 1.826 0.06849 .
## daysun 6.508460 9.884849 0.658 0.51058
## daythu 8.690794 11.201853 0.776 0.43823
## daytue 5.888997 10.973460 0.537 0.59175
## daywed 1.528321 11.609636 0.132 0.89532
## FFMC -0.009119 0.774893 -0.012 0.99062
## DMC 0.191413 0.088011 2.175 0.03013 *
## DC -0.125135 0.059617 -2.099 0.03634 *
## ISI -0.340831 0.845469 -0.403 0.68704
## temp 1.344681 1.073510 1.253 0.21096
## RH -0.092294 0.295425 -0.312 0.75487
## wind 2.123814 1.792910 1.185 0.23678
## rain -1.245041 10.094595 -0.123 0.90189
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.92 on 477 degrees of freedom
## Multiple R-squared: 0.06782, Adjusted R-squared: -0.008398
## F-statistic: 0.8898 on 39 and 477 DF, p-value: 0.6627
##
## Call:
## lm(formula = area ~ temp, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.34 -14.68 -10.39 -3.42 1071.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.4138 9.4996 -0.780 0.4355
## temp 1.0726 0.4808 2.231 0.0261 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.41 on 515 degrees of freedom
## Multiple R-squared: 0.009573, Adjusted R-squared: 0.00765
## F-statistic: 4.978 on 1 and 515 DF, p-value: 0.0261
# modelando log area
fit2 <- lm(larea ~ ., data = dat[,-13]) # tirando a coluna 13, area
summary(fit2)
##
## Call:
## lm(formula = larea ~ ., data = dat[, -13])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1061 -1.0096 -0.4232 0.7837 5.2105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.219707 1.648546 0.133 0.894034
## X.L 0.979328 0.350287 2.796 0.005386 **
## X.Q 1.232741 0.346525 3.557 0.000412 ***
## X.C 0.457943 0.295635 1.549 0.122041
## X^4 0.426364 0.263822 1.616 0.106733
## X^5 0.379344 0.227239 1.669 0.095702 .
## X^6 -0.143288 0.222626 -0.644 0.520128
## X^7 0.239858 0.179044 1.340 0.180993
## X^8 -0.565096 0.222469 -2.540 0.011398 *
## Y.L 0.273759 0.721555 0.379 0.704559
## Y.Q -1.529702 0.465185 -3.288 0.001082 **
## Y.C -2.297495 0.685600 -3.351 0.000869 ***
## Y^4 -2.833179 0.823577 -3.440 0.000633 ***
## Y^5 -1.922948 0.621272 -3.095 0.002083 **
## Y^6 -0.763359 0.310572 -2.458 0.014329 *
## monthaug 0.615221 0.827774 0.743 0.457712
## monthdec 2.087655 0.803208 2.599 0.009635 **
## monthfeb 0.356269 0.558949 0.637 0.524177
## monthjan -0.479557 1.201713 -0.399 0.690026
## monthjul 0.416163 0.718868 0.579 0.562920
## monthjun -0.314739 0.657903 -0.478 0.632587
## monthmar -0.261508 0.502282 -0.521 0.602859
## monthmay 0.782153 1.083690 0.722 0.470802
## monthnov -1.158103 1.457994 -0.794 0.427408
## monthoct 1.210681 0.986380 1.227 0.220278
## monthsep 1.319435 0.920478 1.433 0.152392
## daymon 0.058899 0.225141 0.262 0.793734
## daysat 0.237040 0.215966 1.098 0.272942
## daysun 0.209163 0.209378 0.999 0.318314
## daythu 0.035668 0.237274 0.150 0.880572
## daytue 0.274123 0.232436 1.179 0.238849
## daywed 0.037429 0.245912 0.152 0.879088
## FFMC 0.008562 0.016414 0.522 0.602157
## DMC 0.004477 0.001864 2.401 0.016714 *
## DC -0.002374 0.001263 -1.880 0.060758 .
## ISI -0.006720 0.017908 -0.375 0.707642
## temp 0.014326 0.022739 0.630 0.528983
## RH -0.002568 0.006258 -0.410 0.681746
## wind 0.060259 0.037977 1.587 0.113239
## rain 0.121982 0.213821 0.570 0.568615
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.354 on 477 degrees of freedom
## Multiple R-squared: 0.1334, Adjusted R-squared: 0.06256
## F-statistic: 1.883 on 39 and 477 DF, p-value: 0.00134
##
## Call:
## lm(formula = larea ~ X + Y + month + DMC + DC + temp + wind,
## data = dat[, -13])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1879 -0.9764 -0.4079 0.7768 5.3173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.780632 0.575046 1.358 0.175247
## X.L 0.984402 0.345491 2.849 0.004567 **
## X.Q 1.229688 0.339002 3.627 0.000317 ***
## X.C 0.453981 0.291630 1.557 0.120191
## X^4 0.406301 0.259677 1.565 0.118317
## X^5 0.377971 0.224246 1.686 0.092529 .
## X^6 -0.116953 0.219999 -0.532 0.595241
## X^7 0.233594 0.174057 1.342 0.180203
## X^8 -0.564175 0.218151 -2.586 0.009994 **
## Y.L 0.188704 0.713436 0.264 0.791507
## Y.Q -1.551066 0.453561 -3.420 0.000679 ***
## Y.C -2.270578 0.674401 -3.367 0.000821 ***
## Y^4 -2.771889 0.808894 -3.427 0.000663 ***
## Y^5 -1.888967 0.609695 -3.098 0.002059 **
## Y^6 -0.753255 0.304775 -2.472 0.013795 *
## monthaug 0.643724 0.788255 0.817 0.414531
## monthdec 2.213019 0.773107 2.863 0.004384 **
## monthfeb 0.365346 0.551503 0.662 0.507993
## monthjan -0.682928 1.076240 -0.635 0.526020
## monthjul 0.436781 0.683625 0.639 0.523176
## monthjun -0.394812 0.630963 -0.626 0.531786
## monthmar -0.230436 0.493123 -0.467 0.640495
## monthmay 0.736262 1.056583 0.697 0.486240
## monthnov -0.967716 1.428329 -0.678 0.498400
## monthoct 1.315924 0.957744 1.374 0.170079
## monthsep 1.360462 0.887515 1.533 0.125952
## DMC 0.004546 0.001774 2.563 0.010675 *
## DC -0.002498 0.001231 -2.029 0.042954 *
## temp 0.023922 0.014498 1.650 0.099583 .
## wind 0.058852 0.036372 1.618 0.106294
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.345 on 487 degrees of freedom
## Multiple R-squared: 0.1268, Adjusted R-squared: 0.07477
## F-statistic: 2.438 on 29 and 487 DF, p-value: 6e-05
# retirar coordenadas X e Y
fit4 <- lm(larea ~ ., data = dat[,-c(1:3,13)]) # tirando a colunas 1,2, 13
summary(fit4)
##
## Call:
## lm(formula = larea ~ ., data = dat[, -c(1:3, 13)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6311 -1.0992 -0.6231 0.9036 5.5827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0056581 1.3793960 0.004 0.9967
## daymon 0.1524205 0.2266247 0.673 0.5015
## daysat 0.2844762 0.2183396 1.303 0.1932
## daysun 0.1898639 0.2116043 0.897 0.3700
## daythu 0.0160972 0.2399185 0.067 0.9465
## daytue 0.2529683 0.2337933 1.082 0.2798
## daywed 0.1588203 0.2463140 0.645 0.5194
## FFMC 0.0089740 0.0146007 0.615 0.5391
## DMC 0.0012857 0.0014810 0.868 0.3857
## DC 0.0002961 0.0003599 0.823 0.4111
## ISI -0.0240272 0.0171792 -1.399 0.1625
## temp -0.0005126 0.0176219 -0.029 0.9768
## RH -0.0057340 0.0053279 -1.076 0.2823
## wind 0.0778966 0.0371443 2.097 0.0365 *
## rain 0.0853377 0.2149473 0.397 0.6915
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.4 on 502 degrees of freedom
## Multiple R-squared: 0.02508, Adjusted R-squared: -0.002109
## F-statistic: 0.9224 on 14 and 502 DF, p-value: 0.5342
##
## Call:
## lm(formula = larea ~ DMC + RH + wind, data = dat[, -c(1:3, 13)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5093 -1.1071 -0.6537 0.9216 5.7712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9129363 0.2433065 3.752 0.000195 ***
## DMC 0.0017552 0.0009657 1.817 0.069731 .
## RH -0.0055830 0.0037785 -1.478 0.140138
## wind 0.0624134 0.0345109 1.809 0.071112 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.392 on 513 degrees of freedom
## Multiple R-squared: 0.01425, Adjusted R-squared: 0.008485
## F-statistic: 2.472 on 3 and 513 DF, p-value: 0.06101
Exercício 9.2 Nos dados do Exemplo 9.9, considere utilizar o modelo beta inflated implementado em (Stasinopoulos and Rigby 2007). Dica: use a função gamlss::gamlss
com argumento family = BEINF()
. Veja https://filipezabala.com/voice/ para mais detalhes da implementação.
Exemplo 9.10 No Capítulo 12, (Dalgaard 2008) utiliza a base de dados cystfibr
, um banco de dados sobre capacidade respiratória discutido por (Altman 1990). Abaixo estão as descrições das variáveis observadas, com volumes indicados em decilitros.
Idade
: idade em anos
Sexo
: 0 = masculino, 1 = feminino
IMC
: Índice de Massa Corporal (Peso/Altura\(^2\)) como um percentual da mediana de indivíduos normais por idade
VEF1
: Volume de Expiração Forçado em 1 segundo
VR
: Volume Residual, o volume restante de ar nos pulmões após uma expiração forçada
CRF
: Capacidade Residual Funcional, o volume nos pulmões ao final da posição normal de expiração
CPT
: Capacidade Pulmonar Total
PEmax
: Pressão expiratória estática máxima, variável dependente que indica a saúde do sistema respiratório (maior, melhor)
cystfibr <- read.table('https://filipezabala.com/data/cystfibr.dat', head = T)
pairs(cystfibr, gap=0, cex.labels=0.9) # matriz de correlação
fit <- lm(Pemax ~ ., data = cystfibr) # ajustando modelo saturado (com todas as candidatas)
summary(fit)
##
## Call:
## lm(formula = Pemax ~ ., data = cystfibr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.776 -17.540 3.971 14.584 36.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.1121 185.0664 -0.055 0.957
## Idade -0.4131 4.6791 -0.088 0.931
## Sexo -4.3718 16.4014 -0.267 0.793
## Altura 0.1883 0.8511 0.221 0.828
## Peso 1.2040 1.4592 0.825 0.422
## IMC -0.3796 0.5801 -0.654 0.523
## VEF1 0.8295 1.1885 0.698 0.496
## VR 0.2593 0.2040 1.271 0.223
## CRF -0.4793 0.5165 -0.928 0.368
## TLC 0.5349 0.4802 1.114 0.283
##
## Residual standard error: 26.94 on 15 degrees of freedom
## Multiple R-squared: 0.599, Adjusted R-squared: 0.3584
## F-statistic: 2.49 on 9 and 15 DF, p-value: 0.05706
##
## Call:
## lm(formula = Pemax ~ Peso + VEF1 + VR + CRF + TLC, data = cystfibr)
##
## Coefficients:
## (Intercept) Peso VEF1 VR CRF TLC
## -32.0892 1.2631 0.9593 0.3113 -0.5624 0.5943
fit2 <- lm(Pemax ~ Peso + VEF1 + VR + CRF + TLC, data = cystfibr) # modelo recomendado por step
summary(fit2)
##
## Call:
## lm(formula = Pemax ~ Peso + VEF1 + VR + CRF + TLC, data = cystfibr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.310 -16.724 0.722 19.260 32.688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.0892 57.1098 -0.562 0.58076
## Peso 1.2631 0.3626 3.484 0.00248 **
## VEF1 0.9593 0.6121 1.567 0.13358
## VR 0.3113 0.1483 2.099 0.04940 *
## CRF -0.5624 0.3289 -1.710 0.10351
## TLC 0.5943 0.4234 1.404 0.17653
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.58 on 19 degrees of freedom
## Multiple R-squared: 0.5772, Adjusted R-squared: 0.4659
## F-statistic: 5.187 on 5 and 19 DF, p-value: 0.003615
fit3 <- lm(formula = Pemax ~ Peso + VEF1 + VR - 1, data = cystfibr) # modelo ajustado manualmente
summary(fit3)
##
## Call:
## lm(formula = Pemax ~ Peso + VEF1 + VR - 1, data = cystfibr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.71 -18.10 -0.36 19.39 41.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Peso 1.25825 0.30688 4.100 0.000473 ***
## VEF1 0.94792 0.41104 2.306 0.030903 *
## VR 0.11159 0.03432 3.251 0.003660 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.84 on 22 degrees of freedom
## Multiple R-squared: 0.9585, Adjusted R-squared: 0.9528
## F-statistic: 169.4 on 3 and 22 DF, p-value: 2.383e-15
Exercício 9.3 Acesse o endereço http://archive.ics.uci.edu/datasets e escolha um banco de dados com mais de 100 atributos (variáveis/colunas). Faça análises considerando modelos lineares e, se pertinente, utilizando componentes principais. \(\\\)
9.5.2 Bayesiano
https://alpopkes.com/posts/machine_learning/bayesian_linear_regression/
Exemplo 9.11 Ajustando os modelos lineares clássico e bayesiano para o banco de dados datasets::cars
.
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
# Regressão linear bayesiana via `rstanarm::stan_glm`
library(rstan)
library(rstanarm)
options(mc.cores = parallel::detectCores())
fit_stan <- rstanarm::stan_glm(dist ~ speed, data = cars, family = gaussian)
rstan::stan_trace(fit_stan, pars = c('(Intercept)', 'speed', 'sigma'))
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: dist ~ speed
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 50
## predictors: 2
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) -17.5 6.9 -26.5 -17.4 -8.6
## speed 3.9 0.4 3.4 3.9 4.5
## sigma 15.7 1.6 13.7 15.6 17.8
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 43.0 3.2 38.9 43.1 47.1
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.1 1.0 3698
## speed 0.0 1.0 3611
## sigma 0.0 1.0 3245
## mean_PPD 0.1 1.0 3725
## log-posterior 0.0 1.0 1879
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
# rstanarm::posterior_vs_prior(fit_stan, group_by_parameter = TRUE,
# pars = c('(Intercept)', 'speed', 'sigma'))
Exercício 9.4 Veja as seguintes documentações:
a. ?rstanarm::print.stanreg
.
b. ?rstanarm::prior_summary
.
c. ?rstanarm::summary.stanreg
.
d. ?rstanarm::pp_check
.
e. ?rstanarm::launch_shinystan
.
f. ?rstanarm::posterior_vs_prior
.
9.5.3 Multicolinearidade
Segundo (Everitt and Skrondal 2006), multicolinearidade é um termo utilizado na análise de regressão para indicar situações em que as variáveis explicativas estão relacionadas por uma função linear, impossibilitando a estimação dos coeficientes de regressão.
Exemplo 9.12 Dalpiaz (2022) - Applied Statistics with R traz alguns exemplos sobre colinearidade.
gen_exact_collin_data <- function(num_samples = 100) {
x1 = rnorm(n = num_samples, mean = 80, sd = 10)
x2 = rnorm(n = num_samples, mean = 70, sd = 5)
x3 = 2 * x1 + 4 * x2 + 3
y = 3 + x1 + x2 + rnorm(n = num_samples, mean = 0, sd = 1)
data.frame(y, x1, x2, x3)
}
set.seed(42)
col_data <- gen_exact_collin_data()
head(col_data)
## y x1 x2 x3
## 1 170.7135 93.70958 76.00483 494.4385
## 2 152.9106 74.35302 75.22376 452.6011
## 3 152.7866 83.63128 64.98396 430.1984
## 4 170.6306 86.32863 79.24241 492.6269
## 5 152.3320 84.04268 66.66613 437.7499
## 6 151.3155 78.93875 70.52757 442.9878
## y x1 x2 x3
## y 1.0000000 0.91124848 0.43068603 0.9557550
## x1 0.9112485 1.00000000 0.03127984 0.7638609
## x2 0.4306860 0.03127984 1.00000000 0.6689586
## x3 0.9557550 0.76386089 0.66895856 1.0000000
## y x1 x2 x3
## y 1
## x1 * 1
## x2 . 1
## x3 B , , 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = col_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.57662 -0.66188 -0.08253 0.63706 2.52057
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.957336 1.735165 1.704 0.0915 .
## x1 0.985629 0.009788 100.702 <2e-16 ***
## x2 1.017059 0.022545 45.112 <2e-16 ***
## x3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.014 on 97 degrees of freedom
## Multiple R-squared: 0.9923, Adjusted R-squared: 0.9921
## F-statistic: 6236 on 2 and 97 DF, p-value: < 2.2e-16
X <- cbind(1, as.matrix(col_data[,-1]))
fit1 <- lm(y ~ x1 + x2, data = col_data)
fit2 <- lm(y ~ x1 + x3, data = col_data)
fit3 <- lm(y ~ x2 + x3, data = col_data)
all.equal(fitted(fit1), fitted(fit2))
## [1] TRUE
## [1] TRUE
9.5.3.1 Pacote mctest
O pacote mctest
(Imdad et al. 2019) pode ser utilizado para calcular medidas de diagnóstico de multicolinearidade populares e amplamente utilizadas.
Exemplo 9.13 (Woods, Steinour, and Starke 1932) estudam a associação do calor evoluído durante a fabricação de 13 misturas de cimento de quatro ingredientes básicos. Cada porcentagem de ingrediente parece ser arredondada para um número inteiro. A soma das quatro porcentagens de mistura varia 95% a 99%. Se todas as quatro variáveis \(X\) do regressor sempre somassem 100%, a matriz \(X\) centrada seria então de posto 3. A primeira coluna é a resposta e as quatro colunas restantes são os preditores.
Y
: Calor (cals/gm) evoluído na composição.
X1
: Porcentagem inteira de \(3Ca O Al_2 O_3\) na mistura.
X2
: Porcentagem inteira de \(3Ca O Si O_2\) na mistura.
X3
: Porcentagem inteira de \(4Ca O Al_2 O_3 Fe_2 O_3\) na mistura.
X4
: Porcentagem inteira de \(2Ca O Si O_2\) na mistura.
library(mctest)
data(Hald) ## Hald Cement data
model <- lm(y ~ ., data = as.data.frame(Hald))
## Medidas de diagnóstico geral e autovalores com intercepto
mctest(model)
##
## Call:
## omcdiag(mod = mod, Inter = TRUE, detr = detr, red = red, conf = conf,
## theil = theil, cn = cn)
##
##
## Overall Multicollinearity Diagnostics
##
## MC Results detection
## Determinant |X'X|: 0.0011 1
## Farrar Chi-Square: 67.2825 1
## Red Indicator: 0.5414 1
## Sum of Lambda Inverse: 622.3006 1
## Theil's Method: 0.9981 1
## Condition Number: 249.5783 1
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## Call:
## imcdiag(mod = mod, method = method, corr = FALSE, vif = vif,
## tol = tol, conf = conf, cvif = cvif, ind1 = ind1, ind2 = ind2,
## leamer = leamer, all = all)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein IND1 IND2
## X1 38.4962 0.0260 112.4886 187.4811 0.1612 -0.5846 0 0.0087 0.9875
## X2 254.4232 0.0039 760.2695 1267.1158 0.0627 -3.8635 1 0.0013 1.0099
## X3 46.8684 0.0213 137.6052 229.3419 0.1461 -0.7117 0 0.0071 0.9923
## X4 282.5129 0.0035 844.5386 1407.5643 0.0595 -4.2900 1 0.0012 1.0103
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## X1 , X2 , X3 , X4 , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.9824
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
9.5.3.2 Fator de Inflação da Variância (VIF)
(Chatterjee and Hadi 2012, 248–51) definem o Fator de Inflação da Variância (VIF - Variance Inflation Factor) para a variável \(X_j\) tal que
\[\begin{equation} VIF_j = \frac{1}{1-r_{j}^2}, \; j=1,\ldots,p \tag{9.2} \end{equation}\]
onde \(r_{j}^2\) é o coeficiente de determinação que resulta quando a variável preditora \(X_j\) é regredida contra todas as outras variáveis preditoras. Se \(X_j\) tiver um forte relacionamento linear com as outras variáveis preditoras, \(r_{j}^2\) deve ficar próximo de 1 e \(VIF_j\) seria grande. Como regra de bolso, valores de \(VIF\) maiores que 10 geralmente são tomados como um sinal de que os dados têm problema de colinearidade.
9.5.3.4 Diagnóstico de Belsley, Kuh e Welsch
(Belsley, Kuh, and Welsch 2004, 112–52) sugerem um procedimento de diagnóstico para detectar colinearidade em matrizes \(X\).