用回归分析代替方差分析

Substitute regression for ANOVA

发布于

2024年11月13日

要考察自变量X对连续变量Y的影响，一般而言，当自变量X是连续变量时，我们可以使用回归分析，用X预测Y；当自变量X是分类变量时，我们可以使用方差分析。事实上方差分析可以用回归分析代替。

本文将使用Keng程序包中的depress数据对此进行演示。本文将检验四个班级（class）抑郁（dm1）水平的差异，或者说班级对抑郁的影响。班级有四个：3，5，9，12。

1 方差分析

在SPSS中，我们可以使用Analyze: General Linear Model: Univariate进行被试间方差分析，检验dm1在class上的差异。为便于读者的理解，本文将SPSS的方差分析表重新整理如下：

Tests of Between-Subjects Effects
Source		Type III Sum of Squares	df	Mean Square	F	Sig.
Model	Intercept	335.863	1	335.863	2344.471	0.000
	class	0.722	3	0.241	1.679	0.177
Error		12.893	90	0.143
Total		355.620	94
Corrected Model	class	.722	3	0.241	1.679	0.177
Corrected Error		12.893	90	0.143
Corrected Total		13.615	93

你有没有感到奇怪：方差分析表中为何有截距？

2 基于虚拟编码的回归分析

要通过回归分析检验class这一多水平分类自变量对dm1的影响，我们不能将class直接当做连续变量进行回归分析，因为class的得分3、5、9、12之间的差值没有意义，我们可以将class进行虚拟编码，将其转化为有意义的predictor。对于分类水平为k的分类变量，我们需要将其编码为k-1个虚拟变量，这里我们将class编码为3个虚拟变量class5、class9、class12，编码方案如下：

虚拟编码
class	class5	class9	class12
3	0	0	0
5	1	0	0
9	0	1	0
12	0	0	1

注：class5 = 该班是5班吗；class9 = 该班是9班吗；class12 = 该班是12班吗。

我们使用ifelse()对class进行虚拟编码：

library(car)
library(Keng)
data(depress)
depress$class5 <- ifelse(depress$class == 5, 1, 0)
depress$class9 <- ifelse(depress$class == 9, 1, 0)
depress$class12 <- ifelse(depress$class == 12, 1, 0)

我们将3个虚拟变量class5、class9、class12同时纳入回归方程，它们共同反映了class对dm1的影响。与SPSS一致，lm()函数默认回归模型包含截距，下面两个回归分析所建立的模型是相同的。

fitDummy <- lm(dm1 ~ class5 + class9 + class12, depress)
summary(fitDummy)

## 
## Call:
## lm(formula = dm1 ~ class5 + class9 + class12, data = depress)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85263 -0.25345 -0.00304  0.24423  1.14655 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.80345    0.07028  25.659   <2e-16 ***
## class5       0.11963    0.10222   1.170   0.2450    
## class9       0.24918    0.11171   2.231   0.0282 *  
## class12      0.09655    0.11001   0.878   0.3825    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3785 on 90 degrees of freedom
## Multiple R-squared:  0.053,  Adjusted R-squared:  0.02144 
## F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

fitDummy <- lm(dm1 ~ 1 + class5 + class9 + class12, depress)
summary(fitDummy)

## 
## Call:
## lm(formula = dm1 ~ 1 + class5 + class9 + class12, data = depress)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85263 -0.25345 -0.00304  0.24423  1.14655 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.80345    0.07028  25.659   <2e-16 ***
## class5       0.11963    0.10222   1.170   0.2450    
## class9       0.24918    0.11171   2.231   0.0282 *  
## class12      0.09655    0.11001   0.878   0.3825    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3785 on 90 degrees of freedom
## Multiple R-squared:  0.053,  Adjusted R-squared:  0.02144 
## F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

事实上，我们可以将截距当做一个自变量1（所有被试在这个自变量上得分都为1）的回归系数。当回归模型不包含任何自变量时，回归模型的残差的平方和就是dm1的总平方和。我们用anova()计算回归模型的残差平方和。

fit0 <- lm(dm1 ~ - 1, depress)
anova(fit0)

## Analysis of Variance Table
## 
## Response: dm1
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 94 355.62  3.7832

可得，dm1的总平方和为355.62（df = 94），此即前文SPSS方差分析表中Total所对应的平方和。与class5、class9、class12一样，截距（自变量1）也能解释dm1的变异、分摊dm1的平方和。dm1的平方和可以分解为三部分：截距解释的平方和，class解释的平方和（class5、class9、class12共同解释的平方和），以及残差平方和。下面我们将这三个平方和分别计算出来。残差平方和的计算较为简单，我们只需要计算包含所有自变量的回归模型的残差的平方和，即fit2的残差平方和。

fit1 <- lm(dm1 ~ 1, depress)
anova_fit1 <- anova(fit1)
anova_fit1

## Analysis of Variance Table
## 
## Response: dm1
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 93 13.615  0.1464

可得，在加入截距后，残差的平方和减少为13.615（df = 93），此即前文SPSS方差分析表中Corrected Total所对应的平方和。

fitDummy <- lm(dm1 ~ 1 + class5 + class9 + class12, depress)
anova_fitDummy <- anova(fitDummy)
anova_fitDummy

## Analysis of Variance Table
## 
## Response: dm1
##           Df  Sum Sq Mean Sq F value Pr(>F)  
## class5     1  0.0088 0.00878  0.0613 0.8050  
## class9     1  0.6025 0.60248  4.2056 0.0432 *
## class12    1  0.1103 0.11034  0.7703 0.3825  
## Residuals 90 12.8932 0.14326                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

在加入class5 、class9、class12后，我们得到了完整的回归模型，此时残差的平方和减少为12.8932，此即前文SPSS方差分析表中Error所对应的平方和。

我们使用模型比较的方法，通过残差的变化量计算class的平方和：

anova(fit1, fitDummy)

## Analysis of Variance Table
## 
## Model 1: dm1 ~ 1
## Model 2: dm1 ~ 1 + class5 + class9 + class12
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     93 13.615                           
## 2     90 12.893  3   0.72161 1.6791 0.1771

可得，class5、class9、class12总共解释/分摊/减少的平方和为anova_fit1[,2] - anova_fitDummy[4, 2] = 0.7216098（df = 3），此即前文SPSS方差分析表中class所对应的平方和，亦即Corrected Model所对应的平方和。

使用同样的模型比较的方法，我们可以得到截距所解释/分摊/减少的的平方和：

fit2 <- lm(dm1 ~ -1 + class5 + class9 + class12, depress)
anova(fit2)

## Analysis of Variance Table
## 
## Response: dm1
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## class5     1  96.154  96.154  81.613 2.682e-14 ***
## class9     1  80.053  80.053  67.947 1.200e-12 ***
## class12    1  72.200  72.200  61.281 8.673e-12 ***
## Residuals 91 107.214   1.178                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit2, fitDummy)

## Analysis of Variance Table
## 
## Model 1: dm1 ~ -1 + class5 + class9 + class12
## Model 2: dm1 ~ 1 + class5 + class9 + class12
##   Res.Df     RSS Df Sum of Sq     F    Pr(>F)    
## 1     91 107.214                                 
## 2     90  12.893  1     94.32 658.4 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

可得，截距所解释/分摊/减少的的平方和为94.32。奇怪，为何截距的平方和与前文SPSS方差分析表中Intercept对应的平方和335.863不一致？这是因为虚拟编码不是正交编码，因此截距、所有自变量的平方和的和不等于总平方和。

我们可以使用car程序包中的Anova()函数直接计算截距与各自变量所解释的平方和：

Anova(fitDummy, type = 3)

## Anova Table (Type III tests)
## 
## Response: dm1
##             Sum Sq Df  F value Pr(>F)    
## (Intercept) 94.320  1 658.3971 <2e-16 ***
## class5       0.196  1   1.3695 0.2450    
## class9       0.713  1   4.9754 0.0282 *  
## class12      0.110  1   0.7703 0.3825    
## Residuals   12.893 90                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

从上面的方差分析表可知，class5、class9、class12的平方和的和不等于class的平方和，原因同上：虚拟编码不是正交编码，平方和的和不等于总平方和。尽管基于虚拟编码的回归分析具有上述缺点，但是：（1）虚拟编码的回归系数易于解释；（2）class5、class9、class12总共的平方和与class的平方和是相等的，残差的平方和是相等的，总平方和是相等的，具体见前文的计算。

3 正交编码与平衡设计

所谓正交编码，即编码是正交的、独立的，而不是相关的。什么样的编码方案是正交编码？

我们可以直接计算编码的相关，正交编码的相关为0。以前文虚拟编码为例，下面计算虚拟编码的相关：

Dummy coding
class	class5	class9	class12
3	0	0	0
5	1	0	0
9	0	1	0
12	0	0	1

dat <- data.frame(
  class5 = c(0, 1, 0, 0),
  class9 = c(0, 0, 1, 0),
  class12 = c(0, 0, 0, 1)
)
cor(dat)

##             class5     class9    class12
## class5   1.0000000 -0.3333333 -0.3333333
## class9  -0.3333333  1.0000000 -0.3333333
## class12 -0.3333333 -0.3333333  1.0000000

可见，虚拟编码不是正交编码。

编码方案
x	c1	c2	c3
1	L₁₁	L₁₂	L₁₃
2	L₂₁	L₂₂	L₂₃
3	L₃₁	L₃₂	L₃₃
4	L₄₁	L₄₂	L₄₃

在上面的编码方案中，X表示原始的多分类自变量，c1、c2、c3表示编码，L表示编码系数。L有两个下标，第一个下标表示X的水平，第二个下标表示c的序号。正交编码需满足以下两个条件：

编码变量的系数之和等于0。对c1而言，L11 + L21 + L31 + L41 = 0；c2、c3也应如此。
x的所有水平上，两个编码变量的系数之积的和等于0。对c1与c2而言，L11*L12 + L21*L22 + L31*L32 + L41*L42 = 0；c1与c3、c2与c3也应如此。

除了正交编码，回归分析的平方和若要与ANOVA一致，则还需满足平衡设计这一条件。平衡设计指的是各组的样本量相等。在本文的例子中，平衡设计即要求4个班级的样本量相等。

4 非平衡设计条件下基于Helmert编码的回归分析

当研究设计是平衡设计，即各组样本量相等时，Helmert编码是正交编码，截距、所有自变量的平方和的和等于总平方和。在本研究中，各class人数不等，本研究不是平衡设计。Helmert编码方案如下：

Helmert coding
class	H3	H5	H9
3	3	0	0
5	-1	2	0
9	-1	-1	1
12	-1	-1	-1

contr.helmert(4)

##   [,1] [,2] [,3]
## 1   -1   -1   -1
## 2    1   -1   -1
## 3    0    2   -1
## 4    0    0    3

depress$H3 <- car::recode(depress$class, "3 = 3; else = -1")
depress$H5 <- car::recode(depress$class, "3 = 0; 5 = 2; else = -1")
depress$H9 <- car::recode(depress$class, "c(3,5) = 0; 9 = 1; 12 = -1")

我们采用Helmert编码进行回归分析，并采用Anova()计算截距与自变量的平方和：

fitHelmert <- lm(dm1 ~ 1 + H3 + H5 + H9, depress)
summary(fitHelmert)

## 
## Call:
## lm(formula = dm1 ~ 1 + H3 + H5 + H9, data = depress)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85263 -0.25345 -0.00304  0.24423  1.14655 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.91979    0.03965  48.420   <2e-16 ***
## H3          -0.03878    0.02119  -1.830   0.0706 .  
## H5          -0.01775    0.03195  -0.555   0.5799    
## H9           0.07632    0.06063   1.259   0.2114    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3785 on 90 degrees of freedom
## Multiple R-squared:  0.053,  Adjusted R-squared:  0.02144 
## F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

car::Anova(fitHelmert, type = 3)

## Anova Table (Type III tests)
## 
## Response: dm1
##             Sum Sq Df   F value  Pr(>F)    
## (Intercept) 335.86  1 2344.4705 < 2e-16 ***
## H3            0.48  1    3.3486 0.07057 .  
## H5            0.04  1    0.3086 0.57994    
## H9            0.23  1    1.5845 0.21137    
## Residuals    12.89 90                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

可见，截距、残差的平方和与前文SPSS方差分析表中的平方和一致，classH5、classH9、classH12的平方和的和与前文SPSS方差分析表中class的平方和一致。但是，截距、残差、所有自变量的平方和的和（335.86 + 0.20 + 0.51 + 0.01 ）与前文SPSS方差分析表中Total平方和不一致，这是因为本研究中class不符合平衡设计。

5 平衡设计条件下基于Helmert编码的回归分析

当研究设计是平衡设计（即各组样本量相等），对类别变量进行正交编码，例如Helmert编码，那么，回归分析中截距、所有自变量的平方和的和等于总平方和。读者可以使用Keng程序包中的depress数据自行对此进行试验。