多分类自变量的编码方案

Coding categorical predictors

发布于

2024年11月15日

在回归分析中,若自变量是多分类的类别变量,我们需要将其进行编码,然后将其纳入回归方程。如果该分类变量有k个水平,我们需要将其编码为k-1个变量。

例如,要通过回归分析检验class这一多分类自变量对dm1(depression)的影响,我们不能将class直接当做连续变量进行回归分析,因为class的得分3、5、9、12之间的差值、距离没有实际意义,我们需要将class进行编码,将其转化为有意义的predictor。

1 PROCESS中的编码方案

PROCESS中有4种编码方案:

  • Indicator
  • Sequential
  • Helmert
  • Effect

其中,indicator即为dummy coding。

1.1 Dummy coding

我们将class进行虚拟编码,其编码方案如下:

虚拟编码
class class5 class9 class12
3 0 0 0
5 1 0 0
9 0 1 0
12 0 0 1

注:class5 = 该班是5班吗;class9 = 该班是9班吗;class12 = 该班是12班吗。

class被编码为三个虚拟变量class5、class9、class12,在回归分析中,我们将class5、class9、class12同时纳为自变量,它们共同反映了class对因变量的影响。该编码方案以3班为对照组,将5、9、12班分别与3班进行比较。就回归系数的含义而言,该回归模型dm1 ~ intercept*1 + c5*class5 + c9*class9 + c12*class12的截距intercept表示自变量都为0时dm1的平均水平,当class5、class9、class12都为0表示3班,所以intercept为3班的平均水平。class5 的回归系数c5表示其他自变量保持不变时,class5从0变为1所导致的dm1的变化量,当class为3和5时,class9和class12的取值保持不变,都为0,此时class5从0变为1表示班级从3班变为5班,因此class5的回归系数c5表示5班与3班的差异。同理,class9的回归系数表示9班与3班的差异,class12的回归系数表示12班与3班的差异。

首先,加载程序包与数据。

# dataset depress in Keng package is used in this article
library(Keng)
data(depress)

之后,我们计算各组均值(即各班均值),并使用各组的均值计算回归分析中的系数。这一步的目的是核查回归分析的系数的含义。

# mean of class 3
class3mean <- mean(depress[depress$class == 3, "dm1"])
# mean of class 5
class5mean <- mean(depress[depress$class == 5, "dm1"])
# mean of class 9
class9mean <- mean(depress[depress$class == 9, "dm1"])
# mean of class 12
class12mean <- mean(depress[depress$class == 12, "dm1"])

# compute regression parameters through the means of classes
intercept <- class3mean
c5 <- class5mean - class3mean
c9 <- class9mean - class3mean
c12 <- class12mean - class3mean

我们将class这一个多分类变量编码为3个变量。

# dummy coding
depress$class5 <- ifelse(depress$class == 5, 1, 0)
depress$class9 <- ifelse(depress$class == 9, 1, 0)
depress$class12 <- ifelse(depress$class == 12, 1, 0)

输出基于四个均值计算得到的回归系数。

c(intercept = intercept, c5 = c5, c9 = c9, c12 = c12)
 intercept         c5         c9        c12 
1.80344828 0.11962865 0.24918330 0.09655172 

最后,进行回归分析,并输出其结果。注意将上一步计算得到的回归系数与下面的回归系数进行比较。

# regression analysis
summary(lm(dm1 ~ class5 + class9 + class12, depress))

Call:
lm(formula = dm1 ~ class5 + class9 + class12, data = depress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85263 -0.25345 -0.00304  0.24423  1.14655 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.80345    0.07028  25.659   <2e-16 ***
class5       0.11963    0.10222   1.170   0.2450    
class9       0.24918    0.11171   2.231   0.0282 *  
class12      0.09655    0.11001   0.878   0.3825    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared:  0.053, Adjusted R-squared:  0.02144 
F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

1.2 Sequential coding

Sequential coding将一个水平与后一个水平进行比较。下面的编码方案将5班与3班比较,将9班与5班比较,将12班与9班比较。该编码方案适用于顺序变量(例如:年级1-4)。

Sequential coding
class class5v3 class9v5 class12v9
3 0 0 0
5 1 0 0
9 1 1 0
12 1 1 1

将class5v3、class9v5、class12v9纳为自变量,得到回归方程dm1 = intercept*1 + c5v3*class5v3 + c9v5*class9v5 + c12v9*class12v9,其中intercept为3班的均值。class5v3的回归系数c5v3检验3班与5班的差异,c5v3等于(class5mean - class3mean)。

# recode() function in car package is used in this article
library(car)
Loading required package: carData
# Sequential coding using recode() in car package
depress$class5v3  <- recode(depress$class, "3 = 0; else = 1")
depress$class9v5  <- recode(depress$class, "c(3, 5) = 0; else = 1")
depress$class12v9 <- recode(depress$class, "12= 1; else = 0")

# compute regression parameters through the means of classes
intercept <- class3mean
c5v3 <- class5mean - class3mean
c9v5 <- class9mean - class5mean
c12v9 <- class12mean - class9mean
c(intercept = intercept, c5v3 = c5v3, c9v5 = c9v5, c12v9 = c12v9)
 intercept       c5v3       c9v5      c12v9 
 1.8034483  0.1196286  0.1295547 -0.1526316 
# regression analysis
summary(lm(dm1 ~ class5v3 + class9v5 + class12v9, depress))

Call:
lm(formula = dm1 ~ class5v3 + class9v5 + class12v9, data = depress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85263 -0.25345 -0.00304  0.24423  1.14655 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.80345    0.07028  25.659   <2e-16 ***
class5v3     0.11963    0.10222   1.170    0.245    
class9v5     0.12955    0.11424   1.134    0.260    
class12v9   -0.15263    0.12125  -1.259    0.211    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared:  0.053, Adjusted R-squared:  0.02144 
F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

与Repeated coding相比,Sequential coding的回归系数的含义是一样的,但截距的含义不同。使用Repeated coding时,截距表示4个班级的均值的均值;使用Sequential coding时,截距表示3班的均值。

1.3 Helmert coding

该编码方案适用于顺序变量(例如:年级1-4)。编码方案如下:

Helmert coding
class H3 H5 H9
3 3 0 0
5 -1 2 0
9 -1 -1 1
12 -1 -1 -1

注:H3 = 将3班与5、9、12班的均值进行对比;H5 = 将5班与9、12班的均值进行对比;H9 = 将3班与5、9、12班的均值进行对比。

我们将H3、H5、H9纳为回归方程的自变量,建立回归模型dm1 ~ intercept*1 + c3*H3 + c5*H5 + c9*H9。截距intercept表示四个班级的dm1的均值的均值。H3的回归系数c3检验3班与5、9、12班均值的均值的差异,该系数等于[(class3mean - (class5mean + class9mean + class12mean)/3)/4]。H5的回归系数c5检验5班与9、12班均值的均值的差异,H9的回归系数c9检验9班与12班的均值的差异。

Helmert coding与回归分析的操作如下:

# Helmert coding using recode() in car package
depress$H3 <- recode(depress$class, "3 = 3; else = -1")
depress$H5 <- recode(depress$class, "3 = 0; 5 = 2; else = -1")
depress$H9 <- recode(depress$class, "c(3, 5) = 0; 9 = 1; 12 = -1")

# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c3 <- (class3mean - (class5mean + class9mean + class12mean)/3)/4
c5 <- (class5mean - (class9mean + class12mean)/2)/3
c9 <- (class9mean - class12mean)/2
c(intercept = intercept, c3 = c3, c5 = c5, c9 = c9)
  intercept          c3          c5          c9 
 1.91978919 -0.03878031 -0.01774629  0.07631579 
# regression analysis
summary(lm(dm1 ~ H3 + H5 + H9, depress))

Call:
lm(formula = dm1 ~ H3 + H5 + H9, data = depress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85263 -0.25345 -0.00304  0.24423  1.14655 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.91979    0.03965  48.420   <2e-16 ***
H3          -0.03878    0.02119  -1.830   0.0706 .  
H5          -0.01775    0.03195  -0.555   0.5799    
H9           0.07632    0.06063   1.259   0.2114    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared:  0.053, Adjusted R-squared:  0.02144 
F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

我们可以对编码系数进行调整,使H3的回归系数c3等于[class3mean - (class5mean + class9mean + class12mean)/3]:

Helmert coding
class H3 H5 H9
3 3/4 0 0
5 -1/4 2/3 0
9 -1/4 -1/3 1/2
12 -1/4 -1/3 -1/2
# Helmert coding using recode() in car package
depress$H3 <- recode(depress$class, "3 = 3/4; else = -1/4")
depress$H5 <- recode(depress$class, "3 = 0; 5 = 2/3; else = -1/3")
depress$H9 <- recode(depress$class, "c(3, 5) = 0; 9 = 1/2; 12 = -1/2")

# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c3 <- class3mean - (class5mean + class9mean + class12mean)/3
c5 <- class5mean - (class9mean + class12mean)/2
c9 <- class9mean - class12mean
c(intercept = intercept, c3 = c3, c5 = c5, c9 = c9)
  intercept          c3          c5          c9 
 1.91978919 -0.15512122 -0.05323887  0.15263158 
# regression analysis
summary(lm(dm1 ~ H3 + H5 + H9, depress))

Call:
lm(formula = dm1 ~ H3 + H5 + H9, data = depress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85263 -0.25345 -0.00304  0.24423  1.14655 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.91979    0.03965  48.420   <2e-16 ***
H3          -0.15512    0.08477  -1.830   0.0706 .  
H5          -0.05324    0.09584  -0.555   0.5799    
H9           0.15263    0.12125   1.259   0.2114    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared:  0.053, Adjusted R-squared:  0.02144 
F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

1.4 Effect coding

Effect coding又被称为contrast coding,也被称为deviation coding,其编码方案如下:

Effect coding
class class3 class5 class9
3 1 0 0
5 0 1 0
9 0 0 1
12 -1 -1 -1

将class3、class5、class9纳为自变量,得到回归方程dm1 = intercept*1 + c3*class3 + c5*class5 + c9*class9,其中intercept为4个班的均值的均值。class3的回归系数c3将3班的均值与4个班的均值的均值进行比较,c3等于(class3mean - (class3mean + class5mean + class9mean + class12mean)/3)。class5的回归系数c5将5班的均值与4个班的均值的均值进行比较。class9的回归系数c9将9班的均值与4个班的均值的均值进行比较。

# Effect coding using recode() in car package
depress$class3 <- recode(depress$class, "3 = 1; 12 = -1; else = 0")
depress$class5 <- recode(depress$class, "5 = 1; 12 = -1; else = 0")
depress$class9 <- recode(depress$class, "9 = 1; 12 = -1; else = 0")

# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c3 <- class3mean - mean(c(class3mean, class5mean, class9mean, class12mean))
c5 <- class5mean - mean(c(class3mean, class5mean, class9mean, class12mean))
c9 <- class9mean - mean(c(class3mean, class5mean, class9mean, class12mean))
c(intercept = intercept, c3 = c3, c5 = c5, c9 = c9)
   intercept           c3           c5           c9 
 1.919789194 -0.116340919  0.003287729  0.132842384 
# regression analysis
summary(lm(dm1 ~ class3 + class5 + class9, depress))

Call:
lm(formula = dm1 ~ class3 + class5 + class9, data = depress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85263 -0.25345 -0.00304  0.24423  1.14655 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.919789   0.039649  48.420   <2e-16 ***
class3      -0.116341   0.063577  -1.830   0.0706 .  
class5       0.003288   0.065780   0.050   0.9602    
class9       0.132842   0.073089   1.818   0.0725 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared:  0.053, Adjusted R-squared:  0.02144 
F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

2 SPSS中的编码方案

  • Simple
  • Deviation
  • Difference
  • Helmert
  • Repeated
  • Polynomial

2.1 Simple coding

编码方案如下:

Simple coding
class class5 class9 class12
3 -1 -1 -1
5 3 -1 -1
9 -1 3 -1
12 -1 -1 3

注:class5 = 将5班与3班的均值进行对比;class9 = 将9班与3班的均值进行对比;class12 = 将12班与3班的均值进行对比。

我们将class5、class9、class12纳为回归方程的自变量,建立回归模型dm1 ~ intercept*1 + c5*class5 + c9*class9 + c12*class12。截距intercept表示四个班级的dm1的均值的均值。class5的回归系数c5检验5班与3班的差异的显著性,该系数等于5班均值与3班均值的差值的1/4[(class5mean - class3mean)/4]。class9的回归系数c9检验9班与3班的差异,class12的回归系数c12检验12班与3班的差异。

Simple coding与回归分析的操作如下:

# Simple coding
depress$class5 <- ifelse(depress$class == 5, 3, -1)
depress$class9 <- ifelse(depress$class == 9, 3, -1)
depress$class12 <- ifelse(depress$class == 12, 3, -1)

# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c5 <- (class5mean - class3mean)/4
c9 <- (class9mean - class3mean)/4
c12 <- (class12mean - class3mean)/4
c(intercept = intercept, c5 = c5, c9 = c9, c12 = c12)
 intercept         c5         c9        c12 
1.91978919 0.02990716 0.06229583 0.02413793 
# regression analysis
summary(lm(dm1 ~ class5 + class9 + class12, depress))

Call:
lm(formula = dm1 ~ class5 + class9 + class12, data = depress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85263 -0.25345 -0.00304  0.24423  1.14655 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.91979    0.03965  48.420   <2e-16 ***
class5       0.02991    0.02556   1.170   0.2450    
class9       0.06230    0.02793   2.231   0.0282 *  
class12      0.02414    0.02750   0.878   0.3825    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared:  0.053, Adjusted R-squared:  0.02144 
F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

我们可以对编码系数进行调整,使class5的回归系数c5等于5班均值与3班均值的差值[(class5mean - class3mean)/4]:

# Simple coding
depress$class5 <- ifelse(depress$class == 5, 3/4, -1/4)
depress$class9 <- ifelse(depress$class == 9, 3/4, -1/4)
depress$class12 <- ifelse(depress$class == 12, 3/4, -1/4)

# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c5 <- (class5mean - class3mean)
c9 <- (class9mean - class3mean)
c12 <- (class12mean - class3mean)
c(intercept = intercept, c5 = c5, c9 = c9, c12 = c12)
 intercept         c5         c9        c12 
1.91978919 0.11962865 0.24918330 0.09655172 
# regression analysis
summary(lm(dm1 ~ class5 + class9 + class12, depress))

Call:
lm(formula = dm1 ~ class5 + class9 + class12, data = depress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85263 -0.25345 -0.00304  0.24423  1.14655 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.91979    0.03965  48.420   <2e-16 ***
class5       0.11963    0.10222   1.170   0.2450    
class9       0.24918    0.11171   2.231   0.0282 *  
class12      0.09655    0.11001   0.878   0.3825    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared:  0.053, Adjusted R-squared:  0.02144 
F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

2.2 Helmert coding

Helmert coding前文已介绍过。

2.3 Difference coding

Difference coding又称Reverse Helmert Coding,其编码顺序与Helmert coding相反,Difference coding编码方案如下:

Difference coding
class D5 D9 D12
3 -1 -1 -1
5 1 -1 -1
9 0 2 -1
12 0 0 3

2.4 Polynomial coding

Polynomial coding适用于勾画变量的变化趋势,例如:从3班到5班到9班到12班,dm1是否表现出线性变化趋势呢?我们可以用Polynomial coding对此进行考察。class包含4个类别,需要编码为3个codes,在Polynomial coding中,这三个codes分别代表线性趋势、二次曲线趋势、三次曲线趋势。Polynomial coding编码方案如下:

Polynomial coding
class L Q C
3 -3 1 -1
5 -1 -1 3
9 1 -1 -3
12 3 1 1
# Polynomial coding using recode() in car package
depress$L <- recode(depress$class, "3 = -3; 5 = -1; 9 = 1; 12 = 3")
depress$Q <- recode(depress$class, "3 = 1; 5 = -1; 9 = -1; 12 = 1")
depress$C <- recode(depress$class, "3 = -1; 5 = 3; 9 = -3; 12 = 1")

# regression analysis
summary(lm(dm1 ~ L + Q + C, depress))

Call:
lm(formula = dm1 ~ L + Q + C, data = depress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85263 -0.25345 -0.00304  0.24423  1.14655 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.91979    0.03965  48.420   <2e-16 ***
L            0.02096    0.01746   1.200   0.2332    
Q           -0.06807    0.03965  -1.717   0.0895 .  
C           -0.01461    0.01800  -0.812   0.4192    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared:  0.053, Adjusted R-squared:  0.02144 
F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

2.5 Repeated coding

Repeated coding又被称为backward difference coding,其编码方案如下:

Repeated coding
class class5v3 class9v5 class12v9
3 -3/4 -1/2 -1/4
5 1/4 -1/2 -1/4
9 1/4 1/2 -1/4
12 1/4 1/2 3/4

将class5v3、class9v5、class12v9纳为自变量,得到回归方程dm1 = intercept*1 + c5v3*class5v3 + c9v5*class9v5 + c12v9*class12v9,其中intercept为3班、5班、9班、12班的均值的均值。class5v3的回归系数c5v3检验3班与5班的差异,c5v3等于(class5mean - class3mean)。

# Repeated coding using recode() in car package
depress$class5v3  <- recode(depress$class, "3 = -3/4; else = 1/4")
depress$class9v5  <- recode(depress$class, "c(3, 5) = -1/2; else = 1/2")
depress$class12v9 <- recode(depress$class, "12= 3/4; else = -1/4")

# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c5v3 <- class5mean - class3mean
c9v5 <- class9mean - class5mean
c12v9 <- class12mean - class9mean
c(intercept = intercept, c5v3 = c5v3, c9v5 = c9v5, c12v9 = c12v9)
 intercept       c5v3       c9v5      c12v9 
 1.9197892  0.1196286  0.1295547 -0.1526316 
# regression analysis
summary(lm(dm1 ~ class5v3 + class9v5 + class12v9, depress))

Call:
lm(formula = dm1 ~ class5v3 + class9v5 + class12v9, data = depress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85263 -0.25345 -0.00304  0.24423  1.14655 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.91979    0.03965  48.420   <2e-16 ***
class5v3     0.11963    0.10222   1.170    0.245    
class9v5     0.12955    0.11424   1.134    0.260    
class12v9   -0.15263    0.12125  -1.259    0.211    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared:  0.053, Adjusted R-squared:  0.02144 
F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

3 其他编码方案

3.1 Forward difference coding

我们调整backward difference coding(Repeated coding)的比较方向,就得到了Forward difference coding这一编码方案。

Forward difference coding
class class3v5 class5v9 class9v12
3 3/4 1/2 1/4
5 -1/4 1/2 1/4
9 -1/4 -1/2 1/4
12 -1/4 -1/2 -3/4

3.2 自定义编码方案

我们可以使用自定义编码方案检验我们想要检验的任何差异。在下面的编码方案中,code1将3班与5班进行比较,code2将5班与9班、12班的均值的均值进行比较,code3将12班与3班、5班、9班的均值的均值进行比较。

自定义编码方案
class code1 code2 code3
3 1 0 -1/3
5 -1 1 -1/3
9 0 -1/2 -1/3
12 0 -1/2 1

首先定义这个编码矩阵:

code1 <- c(   1,   -1,    0,    0)
code2 <- c(   0,    1, -1/2, -1/2)
code3 <- c(-1/3, -1/3, -1/3,    1)
# use fractions() in MASS package
library(MASS)
mat <- fractions(cbind(code1, code2, code3))
rownames(mat) <- c("class 3", "class 5", "class 9", "class 12")
mat
         code1 code2 code3
class 3     1     0  -1/3 
class 5    -1     1  -1/3 
class 9     0  -1/2  -1/3 
class 12    0  -1/2     1 

然后使用MASSginv()的对其进行转换,得到我们要在回归分析中实际使用的矩阵:

mat_final <- fractions(ginv(t(mat)))
colnames(mat_final) <- colnames(mat)
rownames(mat_final) <- c("class 3", "class 5", "class 9", "class 12")
mat_final
         code1 code2 code3
class 3   3/4   1/2     0 
class 5  -1/4   1/2     0 
class 9  -1/2    -1  -3/4 
class 12    0     0   3/4 

之后进行回归分析,回归方程是dm1 ~ intercept*1 + c1*code1 + c2*code2 + c3*code3

# Custom coding using recode() in car package
depress$code1 <- recode(depress$class, "3 = 3/4; 5 = -1/4; 9 = -1/2; 12 =   0")
depress$code2 <- recode(depress$class, "3 = 1/2; 5 =  1/2; 9 =   -1; 12 =   0")
depress$code3 <- recode(depress$class, "3 =   0; 5 =    0; 9 = -3/4; 12 = 3/4")

# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c1 <- class3mean - class5mean
c2 <- class5mean - mean(c(class9mean, class12mean))
c3 <- class12mean - mean(c(class3mean, class5mean, class9mean))
c(intercept = intercept, c1 = c1, c2 = c2, c3 = c3)
  intercept          c1          c2          c3 
 1.91978919 -0.11962865 -0.05323887 -0.02638559 
# regression analysis
summary(lm(dm1 ~ code1 + code2 + code3, depress))

Call:
lm(formula = dm1 ~ code1 + code2 + code3, data = depress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85263 -0.25345 -0.00304  0.24423  1.14655 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.91979    0.03965  48.420   <2e-16 ***
code1       -0.11963    0.10222  -1.170    0.245    
code2       -0.05324    0.09584  -0.555    0.580    
code3       -0.02639    0.09572  -0.276    0.783    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared:  0.053, Adjusted R-squared:  0.02144 
F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

注意,前文没有强调截距的编码。对于有k个水平的分类自变量,我们需要将其编码为k-1个变量。那截距呢?如果将截距考虑在内,完整的编码方案应当包含对截距的编码code0。code0的每一个系数都是1/4,表示我们想让截距表示4个水平的均值的均值。

code0 <- c( 1/4,  1/4,  1/4,  1/4)
code1 <- c(   1,   -1,    0,    0)
code2 <- c(   0,    1, -1/2, -1/2)
code3 <- c(-1/3, -1/3, -1/3,    1)
mat <- fractions(cbind(code0, code1, code2, code3))
rownames(mat_final) <- c("class 3", "class 5", "class 9", "class 12")
rownames(mat) <- c("class 3", "class 5", "class 9", "class 12")
mat
         code0 code1 code2 code3
class 3   1/4     1     0  -1/3 
class 5   1/4    -1     1  -1/3 
class 9   1/4     0  -1/2  -1/3 
class 12  1/4     0  -1/2     1 

注意,上面的矩阵并不是我们最终在回归分析中使用的矩阵,我们需要使用MASSginv()的对其进行转换:

mat_final <- fractions(ginv(t(mat)))
colnames(mat_final) <- colnames(mat)
rownames(mat_final) <- c("class 3", "class 5", "class 9", "class 12")
mat_final
         code0 code1 code2 code3
class 3     1   3/4   1/2     0 
class 5     1  -1/4   1/2     0 
class 9     1  -1/2    -1  -3/4 
class 12    1     0     0   3/4 

上面的矩阵是我们最终在回归分析中使用的矩阵。对截距来说,与一般的回归分析一致,我们将截距的自变量编码为1。这里的1实际上是一个向量,它有n 个维度(n 是样本量),每一个case的值都是1。在完整的回归方程中,例如dm1 ~ intercept*1 + c1*code1 + c2*code2 + c3*code3,截距项是intercept*1interceptc1一样,是一个未知的、有待估计的参数,而1code1一样,是一个已知的、不需要估计的变量(向量)。

我们也可以自定义截距,让截距表示3班的均值。

# compute codes
code0 <- c(   1,    0,    0,    0)
code1 <- c(   1,   -1,    0,    0)
code2 <- c(   0,    1, -1/2, -1/2)
code3 <- c(-1/3, -1/3, -1/3,    1)
mat <- fractions(cbind(code0, code1, code2, code3))
rownames(mat) <- c("class 3", "class 5", "class 9", "class 12")
mat
         code0 code1 code2 code3
class 3     1     1     0  -1/3 
class 5     0    -1     1  -1/3 
class 9     0     0  -1/2  -1/3 
class 12    0     0  -1/2     1 
mat_final <- fractions(ginv(t(mat)))
rownames(mat_final) <- c("class 3", "class 5", "class 9", "class 12")
colnames(mat_final) <- colnames(mat)
mat_final
         code0 code1 code2 code3
class 3     1     0     0     0 
class 5     1    -1     0     0 
class 9     1  -5/4  -3/2  -3/4 
class 12    1  -3/4  -1/2   3/4 
# Custom coding using recode() in car package
depress$code1 <- recode(depress$class, "3 = 0; 5 = -1; 9 = -5/4; 12 = -3/4")
depress$code2 <- recode(depress$class, "3 = 0; 5 =  0; 9 = -3/2; 12 = -1/2")
depress$code3 <- recode(depress$class, "3 = 0; 5 =  0; 9 = -3/4; 12 =  3/4")

# compute regression parameters through the means of classes
intercept <- class3mean
c1 <- class3mean - class5mean
c2 <- class5mean - mean(c(class9mean, class12mean))
c3 <- class12mean - mean(c(class3mean, class5mean, class9mean))
c(intercept = intercept, c1 = c1, c2 = c2, c3 = c3)
  intercept          c1          c2          c3 
 1.80344828 -0.11962865 -0.05323887 -0.02638559 
# regression analysis
summary(lm(dm1 ~ code1 + code2 + code3, depress))

Call:
lm(formula = dm1 ~ code1 + code2 + code3, data = depress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85263 -0.25345 -0.00304  0.24423  1.14655 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.80345    0.07028  25.659   <2e-16 ***
code1       -0.11963    0.10222  -1.170    0.245    
code2       -0.05324    0.09584  -0.555    0.580    
code3       -0.02639    0.09572  -0.276    0.783    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared:  0.053, Adjusted R-squared:  0.02144 
F-statistic: 1.679 on 3 and 90 DF,  p-value: 0.1771

3.3 Pairwise coding

对于有3个分类水平的自变量,如果我们要进行两两配对比较,我们总共需要3次配对比较。前面讲到的编码方案只能实现2次配对比较。本文提出一种新的编码方案,利用截距,通过一次回归分析实现三分类自变量3个水平的3次配对比较。

首先我们根据上述目的设计矩阵。这里我们将3班、5班、9班的数据选出来,比较3个班的dm1的差异。注意,这里的p1v3相当于截距。

Pairwise coding
class p3v1 p1v2 p2v3
3 -1 1 0
5 0 -1 1
9 1 0 -1
# compute codes
p1v3 <- c(-1,  0,  1)
p1v2 <- c( 1, -1,  0)
p2v3 <- c( 0,  1, -1)
mat <- fractions(cbind(p1v3, p1v2, p2v3))
rownames(mat) <- c("class 3", "class 5", "class 9")
mat
        p1v3 p1v2 p2v3
class 3 -1    1    0  
class 5  0   -1    1  
class 9  1    0   -1  
mat_final <- fractions(ginv(t(mat)))
rownames(mat_final) <- c("class 3", "class 5", "class 9")
colnames(mat_final) <- colnames(mat)
mat_final
        p1v3 p1v2 p2v3
class 3 -1/3  1/3    0
class 5    0 -1/3  1/3
class 9  1/3    0 -1/3
# Custom coding using recode() in car package
depress$p1v3 <- recode(depress$class, "3 = -1/3; 5 =    0; 9 =  1/3")
depress$p1v2 <- recode(depress$class, "3 =  1/3; 5 = -1/3; 9 =    0")
depress$p2v3 <- recode(depress$class, "3 =    0; 5 =  1/3; 9 = -1/3")

接下来,我们进行回归分析,建立回归方程dm1 = c0*p1v3 + c1*p1v2 + c2*p2v3,注意在这个回归方程中,c0*p1v3是截距项,我们需要将lm()默认的截距项intercept*1去除。

# compute regression parameters through the means of classes
c0 <- class3mean - class9mean
c1 <- class3mean - class5mean
c2 <- class5mean - class9mean
c(c0 = c0, c1 = c1, c2 = c2)
        c0         c1         c2 
-0.2491833 -0.1196286 -0.1295547 
# regression analysis
summary(lm(dm1 ~ p1v3 + p1v2 + p2v3 - 1, depress[depress$class %in% c(3, 5, 9), ]))

Call:
lm(formula = dm1 ~ p1v3 + p1v2 + p2v3 - 1, data = depress[depress$class %in% 
    c(3, 5, 9), ])

Residuals:
   Min     1Q Median     3Q    Max 
 1.083  1.437  1.797  2.230  3.320 

Coefficients: (1 not defined because of singularities)
     Estimate Std. Error t value Pr(>|t|)
p1v3  -1.1086     1.0285  -1.078    0.285
p1v2  -0.4591     0.9608  -0.478    0.634
p2v3       NA         NA      NA       NA

Residual standard error: 1.961 on 72 degrees of freedom
Multiple R-squared:  0.01622,   Adjusted R-squared:  -0.01111 
F-statistic: 0.5935 on 2 and 72 DF,  p-value: 0.5551

结果显示,1 coefficient is not defined because of singularities.