10.5 Regressão Linear

A classe de Modelos Lineares atende a uma ampla gama de problemas aplicados, apresentada em profundidade por (Neter et al. 2005). Este modelos são casos particulares das classes de Modelos Lineares Generalizados (MLG/GLM) (McCullagh and Nelder 1989), Modelos Aditivos Generalizados (MAG/GAM) (Hastie and Tibshirani 1986), dentre outras. (Friedman and Stuetzle 1981) sugerem a Regressão com Perseguição da Projeção, procedimento que modela a superfície de regressão como uma soma de funções de suavização de combinações lineares das variáveis preditoras de forma iterativa. (O’Hagan and Forster 1994, 244–76) e (Paulino, Turkman, and Murteira 2003, 194–208) fornecem uma abordagem bayesiana.

10.5.1 Clássico

Quando trabalha-se com dados multivariados (i.e., quase sempre), recomenda-se a exploração criteoriosa dos dados. Não raro ocorrem valores faltantes (missings), rotulados por NA na linguagem R conforme Seção ??. Tais valores faltantes podem dificultar os procedimentos de modelagem, e sua ocorrência se soma a outras questões como a multicolinearidade.

10.5.1.1 Modelo

O modelo de Regressão Linear Múltipla (RLM) populacional/universal é construído, na abordagem clássica, com todos as \(N\)-uplas da população ou universo, i.e., \((x_{1i},x_{2i},\ldots,x_{pi},y_i), \; i \in \{1,\ldots,N\}\). Pode ser descrito pela Eq. (10.14).

\[\begin{equation} Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \varepsilon, \tag{10.14} \end{equation}\] onde \(\varepsilon \sim \mathcal{N}(0,\sigma_{\varepsilon})\).

Por ter dimensionalidade \(p\) usualmente utiliza-se notação matricial na forma

\[\begin{equation} Y = X \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \tag{10.15} \end{equation}\]

onde \(Y = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}\), \(X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}\), \(\boldsymbol{\beta} = \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix}\) e \(\boldsymbol{\varepsilon} = \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}\).

Estimativa dos parâmetros

Para a obtenção das estimativas dos parâmetros utiliza-se a Eq. (10.16). Para ajustar um modelo RPO, basta eliminar a coluna unitária de \(X\). \[\begin{equation} \boldsymbol{\hat{\beta}} = (X'X)^{-1} X'Y \tag{10.16} \end{equation}\]

Exemplo 10.8 Em http://archive.ics.uci.edu/ml/datasets/Energy+efficiency está disponível uma análise de energia feita por (Tsanas and Xifara 2012) usando 12 formas diferentes de construção simuladas no Ecotect. Os edifícios diferem em relação à área envidraçada, à distribuição da área envidraçada e à orientação, entre outros parâmetros. Foram simuladas várias configurações como funções das características acima mencionadas para obter 768 formas de construção. O conjunto de dados detalhado a seguir compreende 768 amostras e 8 características (X1 a X8), com o objetivo de prever duas respostas reais (Y1 e Y2).

X1: Compactação Relativa X2: Superfície X3: Área da parede X4: Área do telhado X5: Altura total X6: Orientação X7: Área de Envidraçamento X8: Distribuição da Área de Envidraçamento Y1: Carga de aquecimento Y2: Carga de resfriamento

# libs
library(readxl)

# arquivo
url1 = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx'
download.file(url1, 'temp.xlsx', mode = 'wb')
energy = read_excel('temp.xlsx')

# dando uma olhada nas variáveis
str(energy)

## tibble [768 × 10] (S3: tbl_df/tbl/data.frame)
##  $ X1: num [1:768] 0.98 0.98 0.98 0.98 0.9 0.9 0.9 0.9 0.86 0.86 ...
##  $ X2: num [1:768] 514 514 514 514 564 ...
##  $ X3: num [1:768] 294 294 294 294 318 ...
##  $ X4: num [1:768] 110 110 110 110 122 ...
##  $ X5: num [1:768] 7 7 7 7 7 7 7 7 7 7 ...
##  $ X6: num [1:768] 2 3 4 5 2 3 4 5 2 3 ...
##  $ X7: num [1:768] 0 0 0 0 0 0 0 0 0 0 ...
##  $ X8: num [1:768] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Y1: num [1:768] 15.6 15.6 15.6 15.6 20.8 ...
##  $ Y2: num [1:768] 21.3 21.3 21.3 21.3 28.3 ...

# exploratória
summary(energy)

##        X1               X2              X3              X4              X5             X6      
##  Min.   :0.6200   Min.   :514.5   Min.   :245.0   Min.   :110.2   Min.   :3.50   Min.   :2.00  
##  1st Qu.:0.6825   1st Qu.:606.4   1st Qu.:294.0   1st Qu.:140.9   1st Qu.:3.50   1st Qu.:2.75  
##  Median :0.7500   Median :673.8   Median :318.5   Median :183.8   Median :5.25   Median :3.50  
##  Mean   :0.7642   Mean   :671.7   Mean   :318.5   Mean   :176.6   Mean   :5.25   Mean   :3.50  
##  3rd Qu.:0.8300   3rd Qu.:741.1   3rd Qu.:343.0   3rd Qu.:220.5   3rd Qu.:7.00   3rd Qu.:4.25  
##  Max.   :0.9800   Max.   :808.5   Max.   :416.5   Max.   :220.5   Max.   :7.00   Max.   :5.00  
##        X7               X8              Y1              Y2       
##  Min.   :0.0000   Min.   :0.000   Min.   : 6.01   Min.   :10.90  
##  1st Qu.:0.1000   1st Qu.:1.750   1st Qu.:12.99   1st Qu.:15.62  
##  Median :0.2500   Median :3.000   Median :18.95   Median :22.08  
##  Mean   :0.2344   Mean   :2.812   Mean   :22.31   Mean   :24.59  
##  3rd Qu.:0.4000   3rd Qu.:4.000   3rd Qu.:31.67   3rd Qu.:33.13  
##  Max.   :0.4000   Max.   :5.000   Max.   :43.10   Max.   :48.03

sapply(energy, sd)

##         X1         X2         X3         X4         X5         X6         X7         X8         Y1         Y2 
##  0.1057775 88.0861161 43.6264814 45.1659502  1.7511404  1.1187626  0.1332206  1.5509597 10.0902040  9.5133056

# matriz de dispersão
pairs(energy, panel = panel.smooth, main = "Energy")

# histograma de Y1
hist(energy$Y1, 30)

# teste de normalidade
shapiro.test(energy$Y1)

## 
##  Shapiro-Wilk normality test
## 
## data:  energy$Y1
## W = 0.9121, p-value < 2.2e-16

# modelo saturado
fit0 <- lm(Y1 ~ . -Y2, data = energy)
summary(fit0)

## 
## Call:
## lm(formula = Y1 ~ . - Y2, data = energy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8965 -1.3196 -0.0252  1.3532  7.7052 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  84.013418  19.033613   4.414 1.16e-05 ***
## X1          -64.773432  10.289448  -6.295 5.19e-10 ***
## X2           -0.087289   0.017075  -5.112 4.04e-07 ***
## X3            0.060813   0.006648   9.148  < 2e-16 ***
## X4                  NA         NA      NA       NA    
## X5            4.169954   0.337990  12.338  < 2e-16 ***
## X6           -0.023330   0.094705  -0.246  0.80548    
## X7           19.932736   0.813986  24.488  < 2e-16 ***
## X8            0.203777   0.069918   2.915  0.00367 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.934 on 760 degrees of freedom
## Multiple R-squared:  0.9162, Adjusted R-squared:  0.9154 
## F-statistic:  1187 on 7 and 760 DF,  p-value: < 2.2e-16

# filtrando as variáveis com stepwise
fit1 <- step(fit0, trace = 0)
summary(fit1)

## 
## Call:
## lm(formula = Y1 ~ X1 + X2 + X3 + X5 + X7 + X8, data = energy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9315 -1.3189 -0.0262  1.3587  7.7169 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  83.931762  19.018978   4.413 1.17e-05 ***
## X1          -64.773432  10.283096  -6.299 5.06e-10 ***
## X2           -0.087289   0.017065  -5.115 3.97e-07 ***
## X3            0.060813   0.006644   9.153  < 2e-16 ***
## X5            4.169954   0.337781  12.345  < 2e-16 ***
## X7           19.932736   0.813484  24.503  < 2e-16 ***
## X8            0.203777   0.069875   2.916  0.00365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.933 on 761 degrees of freedom
## Multiple R-squared:  0.9162, Adjusted R-squared:  0.9155 
## F-statistic:  1387 on 6 and 761 DF,  p-value: < 2.2e-16

# previsão usando fit1
X1 <- min(energy$X1)
X2 <- min(energy$X2)
X3 <- max(energy$X3)
X5 <- max(energy$X5)
X7 <- max(energy$X7)
X8 <- max(energy$X8)
c1 <- coef(fit1)
c1[1] + c1[2]*X1 + c1[3]*X2 + c1[4]*X3 + c1[5]*X5 + c1[6]*X7 + c1[7]*X8

## (Intercept) 
##    62.37222

X1 <- max(energy$X1)
X2 <- max(energy$X2)
X3 <- min(energy$X3)
X5 <- min(energy$X5)
X7 <- min(energy$X7)
X8 <- min(energy$X8)
c1[1] + c1[2]*X1 + c1[3]*X2 + c1[4]*X3 + c1[5]*X5 + c1[6]*X7 + c1[7]*X8

## (Intercept) 
##   -20.62558

summary(energy)

##        X1               X2              X3              X4              X5             X6      
##  Min.   :0.6200   Min.   :514.5   Min.   :245.0   Min.   :110.2   Min.   :3.50   Min.   :2.00  
##  1st Qu.:0.6825   1st Qu.:606.4   1st Qu.:294.0   1st Qu.:140.9   1st Qu.:3.50   1st Qu.:2.75  
##  Median :0.7500   Median :673.8   Median :318.5   Median :183.8   Median :5.25   Median :3.50  
##  Mean   :0.7642   Mean   :671.7   Mean   :318.5   Mean   :176.6   Mean   :5.25   Mean   :3.50  
##  3rd Qu.:0.8300   3rd Qu.:741.1   3rd Qu.:343.0   3rd Qu.:220.5   3rd Qu.:7.00   3rd Qu.:4.25  
##  Max.   :0.9800   Max.   :808.5   Max.   :416.5   Max.   :220.5   Max.   :7.00   Max.   :5.00  
##        X7               X8              Y1              Y2       
##  Min.   :0.0000   Min.   :0.000   Min.   : 6.01   Min.   :10.90  
##  1st Qu.:0.1000   1st Qu.:1.750   1st Qu.:12.99   1st Qu.:15.62  
##  Median :0.2500   Median :3.000   Median :18.95   Median :22.08  
##  Mean   :0.2344   Mean   :2.812   Mean   :22.31   Mean   :24.59  
##  3rd Qu.:0.4000   3rd Qu.:4.000   3rd Qu.:31.67   3rd Qu.:33.13  
##  Max.   :0.4000   Max.   :5.000   Max.   :43.10   Max.   :48.03

c1[1] + c1[2]*0.7 + c1[3]*808.5 + c1[4]*318.5 + c1[5]*4 + c1[6]*0.4 + c1[7]*0

## (Intercept) 
##    12.03883

newx <- data.frame(X1 = min(energy$X1), X2 = min(energy$X2), X3 = max(energy$X3),
                  X5 = max(energy$X5), X7 = max(energy$X7), X8 = max(energy$X8))
par(mfrow=c(1,1))
plot(predict(fit1))

Após o primeiro ajuste atribuído a fit0 é possível notar que o coeficiente da variável X4 não é possível de ser calculado devido a singularidades, i.e, impossibilidade de inversão das matrizes do modelo. Sendo assim, ao modelo saturado (contendo todas as variáveis candidatas) aplicou-se o método de stepwise, proposto por (Efroymson 1960) e utilizado para selecionar variáveis. Este método busca automaticamente o melhor conjunto de variáveis de maneira a minimizar alguma medida, usualmente o Critério de Informação de Akaike (AIC, na sigla em inglês), sugerido por (Akaike 1974). De acordo com a métrica do stepwise, quanto menor o valor de AIC, melhor a combinação das variáveis.

Pelos resultados obtidos pode-se verificar que o modelo ajustado em fit1 possui todas as variávies significativas para um \(\alpha\) inferior a 0.01, estatística F de 1387 para 6 e 761 graus de liberdade, levando a um p-value geral do modelo menor que \(2.2 \times 10^{-16}\), o que indica boa aderência aos dados. O valor do Multiple R-squared é de 0.9162, indicando que o modelo explica em torno de 92% da variação de Y1. Desta forma este é um modelo aceitável, que possui coeficientes de X1 e X2 negativos, indicando que um aumento destas variáveis (respectivamente compactação relativa e superfície) deve reduzir a carga de aquecimento. Mais especificamente, um aumento de uma unidade na compactação relativa (X1) gera uma redução esperada de 64.77 unidades na carga de aquecimento, mantidas constantes as demais variáveis. As variáveis X3, X5, X7 e X8 possuem coeficientes positivos, levando a um impacto esperado positivo em Y1. Como exemplo, para cada aumento de uma unidade na altura total (X5) espera-se um aumento aproximando de 4.17 unidades na carga de aquecimento. As outras variáveis possuem interpretação análoga, devendo-se sempre observar o sinal dos coeficientes.

Exemplo 10.9 Considere os dados do Exemplo 10.15. O ajuste do modelo é dado da mesma forma que na RLS, bastando separar as variáveis pelo símbolo +.

(fit <- lm(Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data = energy))

## 
## Call:
## lm(formula = Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data = energy)
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X4           X5           X6           X7  
##    84.01342    -64.77343     -0.08729      0.06081           NA      4.16995     -0.02333     19.93274  
##          X8  
##     0.20378

Exercício 10.1 Considere a a Eq. (10.16).
a. Obtenha as estimativas dos modelos utilzados nos Exemplos ?? e ??.
b. Obtenha as estimativas do modelo RLM utilzado no Exemplo 10.9. Dica: retire a variável X4.
c. Obtenha as estimativas dos coeficientes do modelo RPO do Exemplo 10.9.

10.5.1.2 Diagnóstico inferencial

Stepwise

O método de stepwise foi proposto por (Efroymson 1960) com o intuito de selecionar variáveis em regressões múltiplas conforme Eq. (10.14). De acordo com o autor¹², no procedimento de stepwise os resultados intermediários dos métodos para a obtenção das estimativas de regressões múltiplas “são usados para fornecer informações estatísticas valiosas em cada etapa do cálculo”. O autor também ressalta¹³ que o método “é baseado nos fatos que (a) uma variável pode ser indicada como significativa em qualquer estágio inicial e, assim, entrar na equação e (b) após várias outras variáveis serem adicionadas à equação de regressão, o variável inicial pode ser indicada como insignificante. A variável insignificante será removida da equação de regressão antes de adicionar uma nova variável. Portanto, apenas variáveis significativas são incluídas na regressão final”.

Este método busca automaticamente o melhor conjunto de variáveis de maneira a minimizar alguma medida, tal como AIC (Akaike 1974), BIC (Schwarz 1978), dentre outras. Na implementação de R documentada em stats::step, o modelo selecionado será aquele com o menor valor de AIC. Note que ‘seleção de modelos’ e ‘seleção de variáveis’ são sinônimos.

Exemplo 10.10 Considere os dados do Exemplo 10.9. Pode-se selecionar o melhor conjunto de variáveis de acordo com o método de stepwise.

(fit_step <- step(fit, trace = 0))

## 
## Call:
## lm(formula = Y1 ~ X1 + X2 + X3 + X5 + X7 + X8, data = energy)
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X5           X7           X8  
##    83.93176    -64.77343     -0.08729      0.06081      4.16995     19.93274      0.20378

Teste para \(\boldsymbol{\beta}\)

Exemplo 10.11 Considere os dados do Exemplo 10.10. Pode-se gerar um relatório do modelo desejado com base::summary.

summary(fit_step)

## 
## Call:
## lm(formula = Y1 ~ X1 + X2 + X3 + X5 + X7 + X8, data = energy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9315 -1.3189 -0.0262  1.3587  7.7169 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  83.931762  19.018978   4.413 1.17e-05 ***
## X1          -64.773432  10.283096  -6.299 5.06e-10 ***
## X2           -0.087289   0.017065  -5.115 3.97e-07 ***
## X3            0.060813   0.006644   9.153  < 2e-16 ***
## X5            4.169954   0.337781  12.345  < 2e-16 ***
## X7           19.932736   0.813484  24.503  < 2e-16 ***
## X8            0.203777   0.069875   2.916  0.00365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.933 on 761 degrees of freedom
## Multiple R-squared:  0.9162, Adjusted R-squared:  0.9155 
## F-statistic:  1387 on 6 and 761 DF,  p-value: < 2.2e-16

10.5.1.3 Diagnóstico preditivo

Exercício 10.2 Acesse o endereço http://archive.ics.uci.edu/ml/datasets.php e escolha um banco de dados com mais de 100 atributos (variáveis/colunas). Faça análises considerando modelos lineares e, se pertinente, utilizando componentes principais. \(\\\)

Exercício 10.3 Refaça o Exemplo 10.8 utilizando a variável Y2 como resposta, ajustando o modelo saturado, filtrando as variáveis com a função step e interpretando os resultados.

Exemplo 10.12 (Belsley, Kuh, and Welsch 2004, 39) analisam a hipótese desenvolvida por (Modigliani and Papademos 1975), de que a a razão de poupança ao longo da vida (poupança pessoal agregada dividido pelo rendimento disponível) é explicada pelo rendimento disponível per capita, a taxa de variação percentual do rendimento disponível per capita e duas variáveis demográficas: a proporção da população com menos de 15 anos e a proporção da população com mais de 75 anos. Os dados são calculados ao longo da década de 1960-1970 para remover o ciclo de negócios ou outras flutuações de curto prazo.

pairs(LifeCycleSavings, panel = panel.smooth, main = "LifeCycleSavings")

hist(LifeCycleSavings$sr, 30)

plot(ecdf(LifeCycleSavings$sr))

qqnorm(LifeCycleSavings$sr)
qqline(LifeCycleSavings$sr, col = 'red')

shapiro.test(LifeCycleSavings$sr)

## 
##  Shapiro-Wilk normality test
## 
## data:  LifeCycleSavings$sr
## W = 0.98074, p-value = 0.5836

fm0 <- lm(sr ~ ., data = LifeCycleSavings)
summary(fm0)

## 
## Call:
## lm(formula = sr ~ ., data = LifeCycleSavings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2422 -2.6857 -0.2488  2.4280  9.7509 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
## pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
## pop75       -1.6914977  1.0835989  -1.561 0.125530    
## dpi         -0.0003369  0.0009311  -0.362 0.719173    
## ddpi         0.4096949  0.1961971   2.088 0.042471 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 45 degrees of freedom
## Multiple R-squared:  0.3385, Adjusted R-squared:  0.2797 
## F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904

fm1 <- step(fm0, trace = 0)
summary(fm1)

## 
## Call:
## lm(formula = sr ~ pop15 + pop75 + ddpi, data = LifeCycleSavings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2539 -2.6159 -0.3913  2.3344  9.7070 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.1247     7.1838   3.915 0.000297 ***
## pop15        -0.4518     0.1409  -3.206 0.002452 ** 
## pop75        -1.8354     0.9984  -1.838 0.072473 .  
## ddpi          0.4278     0.1879   2.277 0.027478 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.767 on 46 degrees of freedom
## Multiple R-squared:  0.3365, Adjusted R-squared:  0.2933 
## F-statistic: 7.778 on 3 and 46 DF,  p-value: 0.0002646

mctest::mctest(fm1)

## 
## Call:
## omcdiag(mod = mod, Inter = TRUE, detr = detr, red = red, conf = conf, 
##     theil = theil, cn = cn)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                        MC Results detection
## Determinant |X'X|:         0.1739         0
## Farrar Chi-Square:        82.4971         1
## Red Indicator:             0.5254         1
## Sum of Lambda Inverse:    12.4857         0
## Theil's Method:            0.9827         1
## Condition Number:         31.7169         1
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

sort(car::vif(fm1), decreasing = TRUE)

##    pop15    pop75     ddpi 
## 5.745478 5.736014 1.004186

Exercício 10.4 Considere o Exemplo 10.12.
a. A partir da saída de car::vif, você recomenda retirar alguma variável do modelo? Em caso afirmativo, execute sua recomendação.
b. Teste jtools::summ(fm1).

Exemplo 10.13 Considere o banco de dados disponível em https://archive.ics.uci.edu/dataset/162/forest+fires baseado em (Cortez and Morais 2007).

X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
month - month of the year: ‘jan’ to ‘dec’
day - day of the week: ‘mon’ to ‘sun’
FFMC - FFMC index from the FWI system: 18.7 to 96.20
DMC - DMC index from the FWI system: 1.1 to 291.3
DC - DC index from the FWI system: 7.9 to 860.6
ISI - ISI index from the FWI system: 0.0 to 56.10
temp - temperature in Celsius degrees: 2.2 to 33.30
RH - relative humidity in %: 15.0 to 100
wind - wind speed in km/h: 0.40 to 9.40
rain - outside rain in mm/m2 : 0.0 to 6.4
area - the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform).

dat <- read.table('https://filipezabala.com/data/forestfires.csv',
                 sep = ',', head = T)
head(dat)

##   X Y month day FFMC  DMC    DC  ISI temp RH wind rain area
## 1 7 5   mar fri 86.2 26.2  94.3  5.1  8.2 51  6.7  0.0    0
## 2 7 4   oct tue 90.6 35.4 669.1  6.7 18.0 33  0.9  0.0    0
## 3 7 4   oct sat 90.6 43.7 686.9  6.7 14.6 33  1.3  0.0    0
## 4 8 6   mar fri 91.7 33.3  77.5  9.0  8.3 97  4.0  0.2    0
## 5 8 6   mar sun 89.3 51.3 102.2  9.6 11.4 99  1.8  0.0    0
## 6 8 6   aug sun 92.3 85.3 488.0 14.7 22.2 29  5.4  0.0    0

table(dat$X)

## 
##  1  2  3  4  5  6  7  8  9 
## 48 73 55 91 30 86 60 61 13

dat$X <- factor(dat$X, ordered = T)
dat$Y <- factor(dat$Y, ordered = T)
str(dat)

## 'data.frame':    517 obs. of  13 variables:
##  $ X    : Ord.factor w/ 9 levels "1"<"2"<"3"<"4"<..: 7 7 7 8 8 8 8 8 8 7 ...
##  $ Y    : Ord.factor w/ 7 levels "2"<"3"<"4"<"5"<..: 4 3 3 5 5 5 5 5 5 4 ...
##  $ month: chr  "mar" "oct" "oct" "mar" ...
##  $ day  : chr  "fri" "tue" "sat" "fri" ...
##  $ FFMC : num  86.2 90.6 90.6 91.7 89.3 92.3 92.3 91.5 91 92.5 ...
##  $ DMC  : num  26.2 35.4 43.7 33.3 51.3 ...
##  $ DC   : num  94.3 669.1 686.9 77.5 102.2 ...
##  $ ISI  : num  5.1 6.7 6.7 9 9.6 14.7 8.5 10.7 7 7.1 ...
##  $ temp : num  8.2 18 14.6 8.3 11.4 22.2 24.1 8 13.1 22.8 ...
##  $ RH   : int  51 33 33 97 99 29 27 86 63 40 ...
##  $ wind : num  6.7 0.9 1.3 4 1.8 5.4 3.1 2.2 5.4 4 ...
##  $ rain : num  0 0 0 0.2 0 0 0 0 0 0 ...
##  $ area : num  0 0 0 0 0 0 0 0 0 0 ...

# descritivas
hist(dat$area, breaks = 30)

larea <- log(dat$area+1)
hist(larea, breaks = 30)

# modelo saturado
fit0 = lm(area ~ ., data = dat)
summary(fit0)

## 
## Call:
## lm(formula = area ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -56.15  -16.76   -5.70    4.57 1030.29 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.331145  77.828817  -0.004  0.99661   
## X.L          23.833501  16.537260   1.441  0.15018   
## X.Q          10.356880  16.359638   0.633  0.52699   
## X.C           2.497455  13.957082   0.179  0.85806   
## X^4           5.218877  12.455178   0.419  0.67540   
## X^5           3.402785  10.728082   0.317  0.75124   
## X^6           1.491091  10.510319   0.142  0.88724   
## X^7          -4.756939   8.452755  -0.563  0.57386   
## X^8         -15.109426  10.502914  -1.439  0.15092   
## Y.L          36.793833  34.065026   1.080  0.28064   
## Y.Q         -13.635847  21.961665  -0.621  0.53497   
## Y.C         -85.495641  32.367580  -2.641  0.00853 **
## Y^4         -88.552976  38.881540  -2.278  0.02320 * 
## Y^5         -74.363445  29.330619  -2.535  0.01155 * 
## Y^6         -28.773038  14.662296  -1.962  0.05030 . 
## monthaug     45.291708  39.079683   1.159  0.24705   
## monthdec     47.250120  37.919912   1.246  0.21336   
## monthfeb     10.241386  26.388330   0.388  0.69811   
## monthjan     19.196597  56.733553   0.338  0.73524   
## monthjul     32.034695  33.938181   0.944  0.34569   
## monthjun      4.107370  31.059969   0.132  0.89485   
## monthmar     -2.043844  23.713037  -0.086  0.93135   
## monthmay     12.867123  51.161618   0.251  0.80154   
## monthnov      0.142191  68.832738   0.002  0.99835   
## monthoct     73.076570  46.567582   1.569  0.11725   
## monthsep     74.178671  43.456313   1.707  0.08848 . 
## daymon        5.777937  10.629021   0.544  0.58697   
## daysat       18.616813  10.195896   1.826  0.06849 . 
## daysun        6.508460   9.884849   0.658  0.51058   
## daythu        8.690794  11.201853   0.776  0.43823   
## daytue        5.888997  10.973460   0.537  0.59175   
## daywed        1.528321  11.609636   0.132  0.89532   
## FFMC         -0.009119   0.774893  -0.012  0.99062   
## DMC           0.191413   0.088011   2.175  0.03013 * 
## DC           -0.125135   0.059617  -2.099  0.03634 * 
## ISI          -0.340831   0.845469  -0.403  0.68704   
## temp          1.344681   1.073510   1.253  0.21096   
## RH           -0.092294   0.295425  -0.312  0.75487   
## wind          2.123814   1.792910   1.185  0.23678   
## rain         -1.245041  10.094595  -0.123  0.90189   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.92 on 477 degrees of freedom
## Multiple R-squared:  0.06782,    Adjusted R-squared:  -0.008398 
## F-statistic: 0.8898 on 39 and 477 DF,  p-value: 0.6627

# aplicando stepwise
fit1 <- step(fit0, trace = 0)
summary(fit1)

## 
## Call:
## lm(formula = area ~ temp, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -27.34  -14.68  -10.39   -3.42 1071.33 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -7.4138     9.4996  -0.780   0.4355  
## temp          1.0726     0.4808   2.231   0.0261 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.41 on 515 degrees of freedom
## Multiple R-squared:  0.009573,   Adjusted R-squared:  0.00765 
## F-statistic: 4.978 on 1 and 515 DF,  p-value: 0.0261

par(mfrow=c(2,2))
plot(fit1, which=1:4)

# modelando log area
fit2 <- lm(larea ~ ., data = dat[,-13]) # tirando a coluna 13, area
summary(fit2)

## 
## Call:
## lm(formula = larea ~ ., data = dat[, -13])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1061 -1.0096 -0.4232  0.7837  5.2105 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.219707   1.648546   0.133 0.894034    
## X.L          0.979328   0.350287   2.796 0.005386 ** 
## X.Q          1.232741   0.346525   3.557 0.000412 ***
## X.C          0.457943   0.295635   1.549 0.122041    
## X^4          0.426364   0.263822   1.616 0.106733    
## X^5          0.379344   0.227239   1.669 0.095702 .  
## X^6         -0.143288   0.222626  -0.644 0.520128    
## X^7          0.239858   0.179044   1.340 0.180993    
## X^8         -0.565096   0.222469  -2.540 0.011398 *  
## Y.L          0.273759   0.721555   0.379 0.704559    
## Y.Q         -1.529702   0.465185  -3.288 0.001082 ** 
## Y.C         -2.297495   0.685600  -3.351 0.000869 ***
## Y^4         -2.833179   0.823577  -3.440 0.000633 ***
## Y^5         -1.922948   0.621272  -3.095 0.002083 ** 
## Y^6         -0.763359   0.310572  -2.458 0.014329 *  
## monthaug     0.615221   0.827774   0.743 0.457712    
## monthdec     2.087655   0.803208   2.599 0.009635 ** 
## monthfeb     0.356269   0.558949   0.637 0.524177    
## monthjan    -0.479557   1.201713  -0.399 0.690026    
## monthjul     0.416163   0.718868   0.579 0.562920    
## monthjun    -0.314739   0.657903  -0.478 0.632587    
## monthmar    -0.261508   0.502282  -0.521 0.602859    
## monthmay     0.782153   1.083690   0.722 0.470802    
## monthnov    -1.158103   1.457994  -0.794 0.427408    
## monthoct     1.210681   0.986380   1.227 0.220278    
## monthsep     1.319435   0.920478   1.433 0.152392    
## daymon       0.058899   0.225141   0.262 0.793734    
## daysat       0.237040   0.215966   1.098 0.272942    
## daysun       0.209163   0.209378   0.999 0.318314    
## daythu       0.035668   0.237274   0.150 0.880572    
## daytue       0.274123   0.232436   1.179 0.238849    
## daywed       0.037429   0.245912   0.152 0.879088    
## FFMC         0.008562   0.016414   0.522 0.602157    
## DMC          0.004477   0.001864   2.401 0.016714 *  
## DC          -0.002374   0.001263  -1.880 0.060758 .  
## ISI         -0.006720   0.017908  -0.375 0.707642    
## temp         0.014326   0.022739   0.630 0.528983    
## RH          -0.002568   0.006258  -0.410 0.681746    
## wind         0.060259   0.037977   1.587 0.113239    
## rain         0.121982   0.213821   0.570 0.568615    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.354 on 477 degrees of freedom
## Multiple R-squared:  0.1334, Adjusted R-squared:  0.06256 
## F-statistic: 1.883 on 39 and 477 DF,  p-value: 0.00134

# aplicando stepwise
fit3 <- step(fit2, trace = 0)
summary(fit3)

## 
## Call:
## lm(formula = larea ~ X + Y + month + DMC + DC + temp + wind, 
##     data = dat[, -13])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1879 -0.9764 -0.4079  0.7768  5.3173 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.780632   0.575046   1.358 0.175247    
## X.L          0.984402   0.345491   2.849 0.004567 ** 
## X.Q          1.229688   0.339002   3.627 0.000317 ***
## X.C          0.453981   0.291630   1.557 0.120191    
## X^4          0.406301   0.259677   1.565 0.118317    
## X^5          0.377971   0.224246   1.686 0.092529 .  
## X^6         -0.116953   0.219999  -0.532 0.595241    
## X^7          0.233594   0.174057   1.342 0.180203    
## X^8         -0.564175   0.218151  -2.586 0.009994 ** 
## Y.L          0.188704   0.713436   0.264 0.791507    
## Y.Q         -1.551066   0.453561  -3.420 0.000679 ***
## Y.C         -2.270578   0.674401  -3.367 0.000821 ***
## Y^4         -2.771889   0.808894  -3.427 0.000663 ***
## Y^5         -1.888967   0.609695  -3.098 0.002059 ** 
## Y^6         -0.753255   0.304775  -2.472 0.013795 *  
## monthaug     0.643724   0.788255   0.817 0.414531    
## monthdec     2.213019   0.773107   2.863 0.004384 ** 
## monthfeb     0.365346   0.551503   0.662 0.507993    
## monthjan    -0.682928   1.076240  -0.635 0.526020    
## monthjul     0.436781   0.683625   0.639 0.523176    
## monthjun    -0.394812   0.630963  -0.626 0.531786    
## monthmar    -0.230436   0.493123  -0.467 0.640495    
## monthmay     0.736262   1.056583   0.697 0.486240    
## monthnov    -0.967716   1.428329  -0.678 0.498400    
## monthoct     1.315924   0.957744   1.374 0.170079    
## monthsep     1.360462   0.887515   1.533 0.125952    
## DMC          0.004546   0.001774   2.563 0.010675 *  
## DC          -0.002498   0.001231  -2.029 0.042954 *  
## temp         0.023922   0.014498   1.650 0.099583 .  
## wind         0.058852   0.036372   1.618 0.106294    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.345 on 487 degrees of freedom
## Multiple R-squared:  0.1268, Adjusted R-squared:  0.07477 
## F-statistic: 2.438 on 29 and 487 DF,  p-value: 6e-05

# retirar coordenadas X e Y
fit4 <- lm(larea ~ ., data = dat[,-c(1:3,13)]) # tirando a colunas 1,2, 13
summary(fit4)

## 
## Call:
## lm(formula = larea ~ ., data = dat[, -c(1:3, 13)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6311 -1.0992 -0.6231  0.9036  5.5827 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.0056581  1.3793960   0.004   0.9967  
## daymon       0.1524205  0.2266247   0.673   0.5015  
## daysat       0.2844762  0.2183396   1.303   0.1932  
## daysun       0.1898639  0.2116043   0.897   0.3700  
## daythu       0.0160972  0.2399185   0.067   0.9465  
## daytue       0.2529683  0.2337933   1.082   0.2798  
## daywed       0.1588203  0.2463140   0.645   0.5194  
## FFMC         0.0089740  0.0146007   0.615   0.5391  
## DMC          0.0012857  0.0014810   0.868   0.3857  
## DC           0.0002961  0.0003599   0.823   0.4111  
## ISI         -0.0240272  0.0171792  -1.399   0.1625  
## temp        -0.0005126  0.0176219  -0.029   0.9768  
## RH          -0.0057340  0.0053279  -1.076   0.2823  
## wind         0.0778966  0.0371443   2.097   0.0365 *
## rain         0.0853377  0.2149473   0.397   0.6915  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.4 on 502 degrees of freedom
## Multiple R-squared:  0.02508,    Adjusted R-squared:  -0.002109 
## F-statistic: 0.9224 on 14 and 502 DF,  p-value: 0.5342

# aplicando stepwise
fit5 <- step(fit4, trace = 0)
summary(fit5)

## 
## Call:
## lm(formula = larea ~ DMC + RH + wind, data = dat[, -c(1:3, 13)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5093 -1.1071 -0.6537  0.9216  5.7712 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.9129363  0.2433065   3.752 0.000195 ***
## DMC          0.0017552  0.0009657   1.817 0.069731 .  
## RH          -0.0055830  0.0037785  -1.478 0.140138    
## wind         0.0624134  0.0345109   1.809 0.071112 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.392 on 513 degrees of freedom
## Multiple R-squared:  0.01425,    Adjusted R-squared:  0.008485 
## F-statistic: 2.472 on 3 and 513 DF,  p-value: 0.06101

Exercício 10.5 Nos dados do Exemplo 10.13, considere utilizar o modelo beta inflated implementado em (Stasinopoulos and Rigby 2007). Dica: use a função gamlss::gamlss com argumento family = BEINF(). Veja https://filipezabala.com/voice/ para mais detalhes da implementação.

Exemplo 10.14 No Capítulo 12, (Dalgaard 2008) utiliza a base de dados cystfibr, um banco de dados sobre capacidade respiratória discutido por (Altman 1990). Abaixo estão as descrições das variáveis observadas, com volumes indicados em decilitros.

Idade: idade em anos
Sexo: 0 = masculino, 1 = feminino
IMC: Índice de Massa Corporal (Peso/Altura\(^2\)) como um percentual da mediana de indivíduos normais por idade
VEF1: Volume de Expiração Forçado em 1 segundo
VR: Volume Residual, o volume restante de ar nos pulmões após uma expiração forçada
CRF: Capacidade Residual Funcional, o volume nos pulmões ao final da posição normal de expiração
CPT: Capacidade Pulmonar Total
PEmax: Pressão expiratória estática máxima, variável dependente que indica a saúde do sistema respiratório (maior, melhor)

cystfibr <- read.table('https://filipezabala.com/data/cystfibr.dat', head = T)
pairs(cystfibr, gap=0, cex.labels=0.9)  # matriz de correlação

fit <- lm(Pemax ~ ., data = cystfibr)   # ajustando modelo saturado (com todas as candidatas)
summary(fit)

## 
## Call:
## lm(formula = Pemax ~ ., data = cystfibr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.776 -17.540   3.971  14.584  36.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.1121   185.0664  -0.055    0.957
## Idade        -0.4131     4.6791  -0.088    0.931
## Sexo         -4.3718    16.4014  -0.267    0.793
## Altura        0.1883     0.8511   0.221    0.828
## Peso          1.2040     1.4592   0.825    0.422
## IMC          -0.3796     0.5801  -0.654    0.523
## VEF1          0.8295     1.1885   0.698    0.496
## VR            0.2593     0.2040   1.271    0.223
## CRF          -0.4793     0.5165  -0.928    0.368
## TLC           0.5349     0.4802   1.114    0.283
## 
## Residual standard error: 26.94 on 15 degrees of freedom
## Multiple R-squared:  0.599,  Adjusted R-squared:  0.3584 
## F-statistic:  2.49 on 9 and 15 DF,  p-value: 0.05706

step(fit, trace=0)  # stepwise para seleção de variáveis

## 
## Call:
## lm(formula = Pemax ~ Peso + VEF1 + VR + CRF + TLC, data = cystfibr)
## 
## Coefficients:
## (Intercept)         Peso         VEF1           VR          CRF          TLC  
##    -32.0892       1.2631       0.9593       0.3113      -0.5624       0.5943

fit2 <- lm(Pemax ~ Peso + VEF1 + VR + CRF + TLC, data = cystfibr) # modelo recomendado por step
summary(fit2)

## 
## Call:
## lm(formula = Pemax ~ Peso + VEF1 + VR + CRF + TLC, data = cystfibr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.310 -16.724   0.722  19.260  32.688 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -32.0892    57.1098  -0.562  0.58076   
## Peso          1.2631     0.3626   3.484  0.00248 **
## VEF1          0.9593     0.6121   1.567  0.13358   
## VR            0.3113     0.1483   2.099  0.04940 * 
## CRF          -0.5624     0.3289  -1.710  0.10351   
## TLC           0.5943     0.4234   1.404  0.17653   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.58 on 19 degrees of freedom
## Multiple R-squared:  0.5772, Adjusted R-squared:  0.4659 
## F-statistic: 5.187 on 5 and 19 DF,  p-value: 0.003615

fit3 <- lm(formula = Pemax ~ Peso + VEF1 + VR - 1, data = cystfibr) # modelo ajustado manualmente
summary(fit3)

## 
## Call:
## lm(formula = Pemax ~ Peso + VEF1 + VR - 1, data = cystfibr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -44.71 -18.10  -0.36  19.39  41.44 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)    
## Peso  1.25825    0.30688   4.100 0.000473 ***
## VEF1  0.94792    0.41104   2.306 0.030903 *  
## VR    0.11159    0.03432   3.251 0.003660 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.84 on 22 degrees of freedom
## Multiple R-squared:  0.9585, Adjusted R-squared:  0.9528 
## F-statistic: 169.4 on 3 and 22 DF,  p-value: 2.383e-15

par(mfrow = c(2,2)) # dividindo a janela gráfica
plot(fit)

plot(fit2)

plot(fit3)

Exemplo 10.15 No site https://archive.ics.uci.edu/ml/datasets/Energy+efficiency está disponível uma análise de energia feita por (Tsanas and Xifara 2012) usando 12 formas diferentes de construção simuladas no Ecotect. Os edifícios diferem em relação à área envidraçada, à distribuição da área envidraçada e à orientação, entre outros parâmetros. Foram simuladas várias configurações como funções das características acima mencionadas para obter 768 formas de construção. O conjunto de dados detalhado a seguir compreende 768 amostras e 8 características (X1 a X8), com o objetivo de prever duas respostas reais (Y1 e Y2).

X1: Compactação Relativa
X2: Superfície
X3: Área da parede
X4: Área do telhado
X5: Altura total
X6: Orientação
X7: Área de Envidraçamento
X8: Distribuição da Área de Envidraçamento
Y1: Carga de aquecimento
Y2: Carga de resfriamento

Primeiramente é feita a leitura dos dados.

# libs
library(readxl)

# arquivo
url1 = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx'
download.file(url1, 'temp.xlsx', mode = 'wb')
energy = read_excel('temp.xlsx')

# dando uma olhada nas variáveis
str(energy)

## tibble [768 × 10] (S3: tbl_df/tbl/data.frame)
##  $ X1: num [1:768] 0.98 0.98 0.98 0.98 0.9 0.9 0.9 0.9 0.86 0.86 ...
##  $ X2: num [1:768] 514 514 514 514 564 ...
##  $ X3: num [1:768] 294 294 294 294 318 ...
##  $ X4: num [1:768] 110 110 110 110 122 ...
##  $ X5: num [1:768] 7 7 7 7 7 7 7 7 7 7 ...
##  $ X6: num [1:768] 2 3 4 5 2 3 4 5 2 3 ...
##  $ X7: num [1:768] 0 0 0 0 0 0 0 0 0 0 ...
##  $ X8: num [1:768] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Y1: num [1:768] 15.6 15.6 15.6 15.6 20.8 ...
##  $ Y2: num [1:768] 21.3 21.3 21.3 21.3 28.3 ...

As funções base::summary e base::sapply podem auxiliar.

# descritivas
summary(energy)     # Medidas de posição por coluna

##        X1               X2              X3              X4              X5             X6      
##  Min.   :0.6200   Min.   :514.5   Min.   :245.0   Min.   :110.2   Min.   :3.50   Min.   :2.00  
##  1st Qu.:0.6825   1st Qu.:606.4   1st Qu.:294.0   1st Qu.:140.9   1st Qu.:3.50   1st Qu.:2.75  
##  Median :0.7500   Median :673.8   Median :318.5   Median :183.8   Median :5.25   Median :3.50  
##  Mean   :0.7642   Mean   :671.7   Mean   :318.5   Mean   :176.6   Mean   :5.25   Mean   :3.50  
##  3rd Qu.:0.8300   3rd Qu.:741.1   3rd Qu.:343.0   3rd Qu.:220.5   3rd Qu.:7.00   3rd Qu.:4.25  
##  Max.   :0.9800   Max.   :808.5   Max.   :416.5   Max.   :220.5   Max.   :7.00   Max.   :5.00  
##        X7               X8              Y1              Y2       
##  Min.   :0.0000   Min.   :0.000   Min.   : 6.01   Min.   :10.90  
##  1st Qu.:0.1000   1st Qu.:1.750   1st Qu.:12.99   1st Qu.:15.62  
##  Median :0.2500   Median :3.000   Median :18.95   Median :22.08  
##  Mean   :0.2344   Mean   :2.812   Mean   :22.31   Mean   :24.59  
##  3rd Qu.:0.4000   3rd Qu.:4.000   3rd Qu.:31.67   3rd Qu.:33.13  
##  Max.   :0.4000   Max.   :5.000   Max.   :43.10   Max.   :48.03

sapply(energy, sd)  # Desvios padrão (amostrais) por coluna

##         X1         X2         X3         X4         X5         X6         X7         X8         Y1         Y2 
##  0.1057775 88.0861161 43.6264814 45.1659502  1.7511404  1.1187626  0.1332206  1.5509597 10.0902040  9.5133056

As funções base::is.na e VIM::aggr (Kowarik and Templ 2016) permitem analisar os dados faltantes.

sum(is.na(energy))

## [1] 0

VIM::aggr(energy)

Para dados multivariados é usual a inspeção visual através da matriz de dispersão, facilmente obtida via graphics::pairs.

pairs(energy)

10.5.2 Bayesiano

https://alpopkes.com/posts/machine_learning/bayesian_linear_regression/

Exemplo 10.16 Ajustando os modelos lineares clássico e bayesiano para o banco de dados datasets::cars.

ggplot2::qplot(speed, dist, data = cars)

# Regressão linear clássica por MQO
fit_lm <- lm(dist ~ speed, data = cars)
summary(fit_lm)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

par(mfrow=c(2,2))
plot(fit_lm, which = 1:4)

# Regressão linear bayesiana via `rstanarm::stan_glm`
library(rstan)
library(rstanarm)
options(mc.cores = parallel::detectCores())

fit_stan <- rstanarm::stan_glm(dist ~ speed, data = cars, family = gaussian)
rstan::stan_trace(fit_stan, pars = c('(Intercept)', 'speed', 'sigma'))

summary(fit_stan)

## 
## Model Info:
##  function:     stan_glm
##  family:       gaussian [identity]
##  formula:      dist ~ speed
##  algorithm:    sampling
##  sample:       4000 (posterior sample size)
##  priors:       see help('prior_summary')
##  observations: 50
##  predictors:   2
## 
## Estimates:
##               mean   sd    10%   50%   90%
## (Intercept) -17.5    6.9 -26.5 -17.4  -8.6
## speed         3.9    0.4   3.4   3.9   4.5
## sigma        15.7    1.6  13.7  15.6  17.8
## 
## Fit Diagnostics:
##            mean   sd   10%   50%   90%
## mean_PPD 43.0    3.2 38.9  43.1  47.1 
## 
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
## 
## MCMC diagnostics
##               mcse Rhat n_eff
## (Intercept)   0.1  1.0  3698 
## speed         0.0  1.0  3611 
## sigma         0.0  1.0  3245 
## mean_PPD      0.1  1.0  3725 
## log-posterior 0.0  1.0  1879 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

rstanarm::pp_check(fit_stan)

# rstanarm::posterior_vs_prior(fit_stan, group_by_parameter = TRUE, 
                             # pars = c('(Intercept)', 'speed', 'sigma'))

Exercício 10.6 Veja as seguintes documentações:
a. ?rstanarm::print.stanreg.
b. ?rstanarm::prior_summary.
c. ?rstanarm::summary.stanreg.
d. ?rstanarm::pp_check.
e. ?rstanarm::launch_shinystan.
f. ?rstanarm::posterior_vs_prior.

10.5.3 Multicolinearidade

Segundo (B. S. Everitt and Skrondal 2006), multicolinearidade é um termo utilizado na análise de regressão para indicar situações em que as variáveis explicativas estão relacionadas por uma função linear, impossibilitando a estimação dos coeficientes de regressão.

Exemplo 10.17 Dalpiaz (2022) - Applied Statistics with R traz alguns exemplos sobre colinearidade.

gen_exact_collin_data <- function(num_samples = 100) {
  x1 = rnorm(n = num_samples, mean = 80, sd = 10)
  x2 = rnorm(n = num_samples, mean = 70, sd = 5)
  x3 = 2 * x1 + 4 * x2 + 3
  y = 3 + x1 + x2 + rnorm(n = num_samples, mean = 0, sd = 1)
  data.frame(y, x1, x2, x3)
}

set.seed(42)
col_data <- gen_exact_collin_data()
head(col_data)

##          y       x1       x2       x3
## 1 170.7135 93.70958 76.00483 494.4385
## 2 152.9106 74.35302 75.22376 452.6011
## 3 152.7866 83.63128 64.98396 430.1984
## 4 170.6306 86.32863 79.24241 492.6269
## 5 152.3320 84.04268 66.66613 437.7499
## 6 151.3155 78.93875 70.52757 442.9878

pairs(col_data, col = "dodgerblue")

cor(col_data)

##            y         x1         x2        x3
## y  1.0000000 0.91124848 0.43068603 0.9557550
## x1 0.9112485 1.00000000 0.03127984 0.7638609
## x2 0.4306860 0.03127984 1.00000000 0.6689586
## x3 0.9557550 0.76386089 0.66895856 1.0000000

symnum(cor(col_data))

##    y x1 x2 x3
## y  1         
## x1 * 1       
## x2 .    1    
## x3 B ,  ,  1 
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

col_data_fit <- lm(y ~ x1 + x2 + x3, data = col_data)
summary(col_data_fit)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = col_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.57662 -0.66188 -0.08253  0.63706  2.52057 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.957336   1.735165   1.704   0.0915 .  
## x1          0.985629   0.009788 100.702   <2e-16 ***
## x2          1.017059   0.022545  45.112   <2e-16 ***
## x3                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.014 on 97 degrees of freedom
## Multiple R-squared:  0.9923, Adjusted R-squared:  0.9921 
## F-statistic:  6236 on 2 and 97 DF,  p-value: < 2.2e-16

X <- cbind(1, as.matrix(col_data[,-1]))

fit1 <- lm(y ~ x1 + x2, data = col_data)
fit2 <- lm(y ~ x1 + x3, data = col_data)
fit3 <- lm(y ~ x2 + x3, data = col_data)
all.equal(fitted(fit1), fitted(fit2))

## [1] TRUE

all.equal(fitted(fit2), fitted(fit3))

## [1] TRUE

10.5.3.1 Pacote `mctest`

O pacote mctest (Imdad et al. 2019) pode ser utilizado para calcular medidas de diagnóstico de multicolinearidade populares e amplamente utilizadas.

Exemplo 10.18 (Woods, Steinour, and Starke 1932) estudam a associação do calor evoluído durante a fabricação de 13 misturas de cimento de quatro ingredientes básicos. Cada porcentagem de ingrediente parece ser arredondada para um número inteiro. A soma das quatro porcentagens de mistura varia 95% a 99%. Se todas as quatro variáveis \(X\) do regressor sempre somassem 100%, a matriz \(X\) centrada seria então de posto 3. A primeira coluna é a resposta e as quatro colunas restantes são os preditores.

Y: Calor (cals/gm) evoluído na composição.
X1: Porcentagem inteira de \(3Ca O Al_2 O_3\) na mistura.
X2: Porcentagem inteira de \(3Ca O Si O_2\) na mistura.
X3: Porcentagem inteira de \(4Ca O Al_2 O_3 Fe_2 O_3\) na mistura.
X4: Porcentagem inteira de \(2Ca O Si O_2\) na mistura.

library(mctest)
data(Hald) ## Hald Cement data
model <- lm(y ~ ., data = as.data.frame(Hald))
## Medidas de diagnóstico geral e autovalores com intercepto
mctest(model)

## 
## Call:
## omcdiag(mod = mod, Inter = TRUE, detr = detr, red = red, conf = conf, 
##     theil = theil, cn = cn)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                        MC Results detection
## Determinant |X'X|:         0.0011         1
## Farrar Chi-Square:        67.2825         1
## Red Indicator:             0.5414         1
## Sum of Lambda Inverse:   622.3006         1
## Theil's Method:            0.9981         1
## Condition Number:        249.5783         1
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

## Medidas de diagnóstico individuais
mctest(model, type = 'i')

## 
## Call:
## imcdiag(mod = mod, method = method, corr = FALSE, vif = vif, 
##     tol = tol, conf = conf, cvif = cvif, ind1 = ind1, ind2 = ind2, 
##     leamer = leamer, all = all)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##         VIF    TOL       Wi        Fi Leamer    CVIF Klein   IND1   IND2
## X1  38.4962 0.0260 112.4886  187.4811 0.1612 -0.5846     0 0.0087 0.9875
## X2 254.4232 0.0039 760.2695 1267.1158 0.0627 -3.8635     1 0.0013 1.0099
## X3  46.8684 0.0213 137.6052  229.3419 0.1461 -0.7117     0 0.0071 0.9923
## X4 282.5129 0.0035 844.5386 1407.5643 0.0595 -4.2900     1 0.0012 1.0103
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## X1 , X2 , X3 , X4 , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.9824 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

10.5.3.2 Fator de Inflação da Variância (VIF)

(Chatterjee and Hadi 2012, 248–51) definem o Fator de Inflação da Variância (VIF - Variance Inflation Factor) para a variável \(X_j\) tal que

\[\begin{equation} VIF_j = \frac{1}{1-r_{j}^2}, \; j=1,\ldots,p \tag{10.17} \end{equation}\]

onde \(r_{j}^2\) é o coeficiente de determinação que resulta quando a variável preditora \(X_j\) é regredida contra todas as outras variáveis preditoras. Se \(X_j\) tiver um forte relacionamento linear com as outras variáveis preditoras, \(r_{j}^2\) deve ficar próximo de 1 e \(VIF_j\) seria grande. Como regra de bolso, valores de \(VIF\) maiores que 10 geralmente são tomados como um sinal de que os dados têm problema de colinearidade.

10.5.3.3 Técnica de Farrar and Glauber

(Farrar and Glauber 1967)

10.5.3.4 Diagnóstico de Belsley, Kuh e Welsch

(Belsley, Kuh, and Welsch 2004, 112–52) sugerem um procedimento de diagnóstico para detectar colinearidade em matrizes \(X\).

Referências

Akaike, Hirotugu. 1974. “A New Look at the Statistical Model Identification.” In Selected Papers of Hirotugu Akaike, 215–22. Springer.

Altman, Douglas G. 1990. Practical Statistics for Medical Research. CRC press.

Belsley, David A, Edwin Kuh, and Roy E Welsch. 2004. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons.

Chatterjee, Samprit, and Ali S Hadi. 2012. Regression Analysis by Example. John Wiley & Sons.

Cortez, Paulo, and Anı́bal de Jesus Raimundo Morais. 2007. “A Data Mining Approach to Predict Forest Fires Using Meteorological Data.” http://www3.dsi.uminho.pt/pcortez/fires.pdf.

Dalgaard, Peter. 2008. Introductory Statistics with r. Springer New York. http://www.academia.dk/BiologiskAntropologi/Epidemiologi/PDF/Introductory_Statistics_with_R__2nd_ed.pdf.

Efroymson, MA. 1960. “Multiple Regression Analysis.” Mathematical Methods for Digital Computers, 191–203.

Everitt, Brian S, and Anders Skrondal. 2006. “The Cambridge Dictionary of Statistics.” https://www.cambridge.org/9780521766999.

Farrar, Donald E, and Robert R Glauber. 1967. “Multicollinearity in Regression Analysis: The Problem Revisited.” The Review of Economic and Statistics, 92–107. https://www.jstor.org/stable/pdf/1937887.pdf?casa_token=JXwdzUDd87wAAAAA:tRc7XNasgpfWvCiiLJ-cu19tCEU31uqBau0uvIKg1QByBg5aeNxHiwCUj-uprWrlddYVQNL7oj4Q0Lumm6bbddU6FV-DxWvdnYEnsZxzUWWk-Rt6bq4.

Friedman, Jerome H, and Werner Stuetzle. 1981. “Projection Pursuit Regression.” Journal of the American Statistical Association 76 (376): 817–23. https://doi.org/10.2307/2287576.

Hastie, Trevor, and Robert Tibshirani. 1986. “[Generalized Additive Models]: Comment.” Statistical Science 1 (3): 297–318.

Imdad, M. U., M. Aslam, S. Altaf, and A. Munir. 2019. “Some New Diagnostics of Multicollinearity in Linear Regression Model.” Sains Malaysiana 48(9): 2051–60. http://dx.doi.org/10.17576/jsm-2019-4809-26.

Kowarik, Alexander, and Matthias Templ. 2016. “Imputation with the R Package VIM.” Journal of Statistical Software 74 (7): 1–16. https://doi.org/10.18637/jss.v074.i07.

McCullagh, Peter, and John Ashworth Nelder. 1989. Generalized Linear Models. Chapman Hall, London. 2nd ed. http://www.utstat.toronto.edu/~brunner/oldclass/2201s11/readings/glmbook.pdf.

Modigliani, Franco, and Lucas Papademos. 1975. “Targets for Monetary Policy in the Coming Year.” Brookings Papers on Economic Activity 1975 (1): 141–65. http://pombo.free.fr/modipapa.pdf.

Neter, John, Michael H Kutner, Christopher J Nachtsheim, and William Wasserman. 2005. Applied Linear Statistical Models. 5th ed. McGraw Hill/Irwin New York. https://www.ime.unicamp.br/~dias/John%20Neter%20Applied%20linear%20regression%20models.pdf.

O’Hagan, Anthony, and Jonathan Forster. 1994. “Kendall’s Advanced Theory of Statistics: Volume 2B.” Bayesian Inference.

Paulino, Carlos Daniel Mimoso, Maria Antónia Amaral Turkman, and Bento Murteira. 2003. Estatı́stica Bayesiana. Fundação Calouste Gulbenkian, Lisboa. http://primo-pmtna01.hosted.exlibrisgroup.com/PUC01:PUC01:puc01000334509.

Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics, 461–64. https://doi.org/10.1214/aos/1176344136.

Stasinopoulos, D. Mikis, and Robert A. Rigby. 2007. “Generalized Additive Models for Location Scale and Shape (GAMLSS) in R.” Journal of Statistical Software 23 (7): 1–46. https://doi.org/10.18637/jss.v023.i07.

Tsanas, Athanasios, and Angeliki Xifara. 2012. “Accurate Quantitative Estimation of Energy Performance of Residential Buildings Using Statistical Machine Learning Tools.” Energy and Buildings 49: 560–67. http://people.maths.ox.ac.uk/tsanas/Preprints/ENB2012.pdf.

Woods, Hubert, Harold H Steinour, and Howard R Starke. 1932. “Effect of Composition of Portland Cement on Heat Evolved During Hardening.” Industrial & Engineering Chemistry 24 (11): 1207–14. https://pubs.acs.org/doi/pdf/10.1021/ie50275a002?casa_token=MxKTBQS6EngAAAAA:RHckY1tRKzRoRQBbw27GR6nQR819Jt7crvXsG2C9spIZBRdFENcdriskM4ba9QPo2qndXj_95kdpB9k.

“In the stepwise procedure, to be described below, intermediate results, which are not even recorded by normal calculation methods, are used to give valuable statistical information at each step in the calculation. These intermediate answers are also used to control the method of calculation.”↩︎
“An important property of the stepwise procedure is based on the facts that (a) a variable may be indicated to be significant in any early stage and thus enter the equation, and (b) after several other variables are added to the regression equation, the initial variable may be indicated to be insignificant. The insignificant variable will be removed from the regression equation before adding an additional variable. Therefore, only significant variables are included in the final regression.”↩︎