多分类自变量的编码方案
Coding categorical predictors
在回归分析中,若自变量是多分类的类别变量,我们需要将其进行编码,然后将其纳入回归方程。如果该分类变量有k个水平,我们需要将其编码为k-1个变量。
例如,要通过回归分析检验class这一多分类自变量对dm1(depression)的影响,我们不能将class直接当做连续变量进行回归分析,因为class的得分3、5、9、12之间的差值、距离没有实际意义,我们需要将class进行编码,将其转化为有意义的predictor。
1 PROCESS中的编码方案
PROCESS中有4种编码方案:
- Indicator
- Sequential
- Helmert
- Effect
其中,indicator即为dummy coding。
1.1 Dummy coding
我们将class
进行虚拟编码,其编码方案如下:
class | class5 | class9 | class12 |
---|---|---|---|
3 | 0 | 0 | 0 |
5 | 1 | 0 | 0 |
9 | 0 | 1 | 0 |
12 | 0 | 0 | 1 |
注:class5 = 该班是5班吗;class9 = 该班是9班吗;class12 = 该班是12班吗。
class被编码为三个虚拟变量class5、class9、class12,在回归分析中,我们将class5、class9、class12同时纳为自变量,它们共同反映了class对因变量的影响。该编码方案以3班为对照组,将5、9、12班分别与3班进行比较。就回归系数的含义而言,该回归模型dm1 ~ intercept*1 + c5*class5 + c9*class9 + c12*class12
的截距intercept表示自变量都为0时dm1的平均水平,当class5、class9、class12都为0表示3班,所以intercept为3班的平均水平。class5 的回归系数c5表示其他自变量保持不变时,class5从0变为1所导致的dm1的变化量,当class为3和5时,class9和class12的取值保持不变,都为0,此时class5从0变为1表示班级从3班变为5班,因此class5的回归系数c5表示5班与3班的差异。同理,class9的回归系数表示9班与3班的差异,class12的回归系数表示12班与3班的差异。
首先,加载程序包与数据。
之后,我们计算各组均值(即各班均值),并使用各组的均值计算回归分析中的系数。这一步的目的是核查回归分析的系数的含义。
# mean of class 3
class3mean <- mean(depress[depress$class == 3, "dm1"])
# mean of class 5
class5mean <- mean(depress[depress$class == 5, "dm1"])
# mean of class 9
class9mean <- mean(depress[depress$class == 9, "dm1"])
# mean of class 12
class12mean <- mean(depress[depress$class == 12, "dm1"])
# compute regression parameters through the means of classes
intercept <- class3mean
c5 <- class5mean - class3mean
c9 <- class9mean - class3mean
c12 <- class12mean - class3mean
我们将class这一个多分类变量编码为3个变量。
输出基于四个均值计算得到的回归系数。
intercept c5 c9 c12
1.80344828 0.11962865 0.24918330 0.09655172
最后,进行回归分析,并输出其结果。注意将上一步计算得到的回归系数与下面的回归系数进行比较。
Call:
lm(formula = dm1 ~ class5 + class9 + class12, data = depress)
Residuals:
Min 1Q Median 3Q Max
-0.85263 -0.25345 -0.00304 0.24423 1.14655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.80345 0.07028 25.659 <2e-16 ***
class5 0.11963 0.10222 1.170 0.2450
class9 0.24918 0.11171 2.231 0.0282 *
class12 0.09655 0.11001 0.878 0.3825
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared: 0.053, Adjusted R-squared: 0.02144
F-statistic: 1.679 on 3 and 90 DF, p-value: 0.1771
1.2 Sequential coding
Sequential coding将一个水平与后一个水平进行比较。下面的编码方案将5班与3班比较,将9班与5班比较,将12班与9班比较。该编码方案适用于顺序变量(例如:年级1-4)。
class | class5v3 | class9v5 | class12v9 |
---|---|---|---|
3 | 0 | 0 | 0 |
5 | 1 | 0 | 0 |
9 | 1 | 1 | 0 |
12 | 1 | 1 | 1 |
将class5v3、class9v5、class12v9纳为自变量,得到回归方程dm1 = intercept*1 + c5v3*class5v3 + c9v5*class9v5 + c12v9*class12v9
,其中intercept为3班的均值。class5v3的回归系数c5v3检验3班与5班的差异,c5v3等于(class5mean - class3mean)。
Loading required package: carData
# Sequential coding using recode() in car package
depress$class5v3 <- recode(depress$class, "3 = 0; else = 1")
depress$class9v5 <- recode(depress$class, "c(3, 5) = 0; else = 1")
depress$class12v9 <- recode(depress$class, "12= 1; else = 0")
# compute regression parameters through the means of classes
intercept <- class3mean
c5v3 <- class5mean - class3mean
c9v5 <- class9mean - class5mean
c12v9 <- class12mean - class9mean
c(intercept = intercept, c5v3 = c5v3, c9v5 = c9v5, c12v9 = c12v9)
intercept c5v3 c9v5 c12v9
1.8034483 0.1196286 0.1295547 -0.1526316
Call:
lm(formula = dm1 ~ class5v3 + class9v5 + class12v9, data = depress)
Residuals:
Min 1Q Median 3Q Max
-0.85263 -0.25345 -0.00304 0.24423 1.14655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.80345 0.07028 25.659 <2e-16 ***
class5v3 0.11963 0.10222 1.170 0.245
class9v5 0.12955 0.11424 1.134 0.260
class12v9 -0.15263 0.12125 -1.259 0.211
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared: 0.053, Adjusted R-squared: 0.02144
F-statistic: 1.679 on 3 and 90 DF, p-value: 0.1771
与Repeated coding相比,Sequential coding的回归系数的含义是一样的,但截距的含义不同。使用Repeated coding时,截距表示4个班级的均值的均值;使用Sequential coding时,截距表示3班的均值。
1.3 Helmert coding
该编码方案适用于顺序变量(例如:年级1-4)。编码方案如下:
class | H3 | H5 | H9 |
---|---|---|---|
3 | 3 | 0 | 0 |
5 | -1 | 2 | 0 |
9 | -1 | -1 | 1 |
12 | -1 | -1 | -1 |
注:H3 = 将3班与5、9、12班的均值进行对比;H5 = 将5班与9、12班的均值进行对比;H9 = 将3班与5、9、12班的均值进行对比。
我们将H3、H5、H9纳为回归方程的自变量,建立回归模型dm1 ~ intercept*1 + c3*H3 + c5*H5 + c9*H9
。截距intercept表示四个班级的dm1的均值的均值。H3的回归系数c3检验3班与5、9、12班均值的均值的差异,该系数等于[(class3mean - (class5mean + class9mean + class12mean)/3)/4]。H5的回归系数c5检验5班与9、12班均值的均值的差异,H9的回归系数c9检验9班与12班的均值的差异。
Helmert coding与回归分析的操作如下:
# Helmert coding using recode() in car package
depress$H3 <- recode(depress$class, "3 = 3; else = -1")
depress$H5 <- recode(depress$class, "3 = 0; 5 = 2; else = -1")
depress$H9 <- recode(depress$class, "c(3, 5) = 0; 9 = 1; 12 = -1")
# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c3 <- (class3mean - (class5mean + class9mean + class12mean)/3)/4
c5 <- (class5mean - (class9mean + class12mean)/2)/3
c9 <- (class9mean - class12mean)/2
c(intercept = intercept, c3 = c3, c5 = c5, c9 = c9)
intercept c3 c5 c9
1.91978919 -0.03878031 -0.01774629 0.07631579
Call:
lm(formula = dm1 ~ H3 + H5 + H9, data = depress)
Residuals:
Min 1Q Median 3Q Max
-0.85263 -0.25345 -0.00304 0.24423 1.14655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.91979 0.03965 48.420 <2e-16 ***
H3 -0.03878 0.02119 -1.830 0.0706 .
H5 -0.01775 0.03195 -0.555 0.5799
H9 0.07632 0.06063 1.259 0.2114
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared: 0.053, Adjusted R-squared: 0.02144
F-statistic: 1.679 on 3 and 90 DF, p-value: 0.1771
我们可以对编码系数进行调整,使H3的回归系数c3等于[class3mean - (class5mean + class9mean + class12mean)/3]:
class | H3 | H5 | H9 |
---|---|---|---|
3 | 3/4 | 0 | 0 |
5 | -1/4 | 2/3 | 0 |
9 | -1/4 | -1/3 | 1/2 |
12 | -1/4 | -1/3 | -1/2 |
# Helmert coding using recode() in car package
depress$H3 <- recode(depress$class, "3 = 3/4; else = -1/4")
depress$H5 <- recode(depress$class, "3 = 0; 5 = 2/3; else = -1/3")
depress$H9 <- recode(depress$class, "c(3, 5) = 0; 9 = 1/2; 12 = -1/2")
# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c3 <- class3mean - (class5mean + class9mean + class12mean)/3
c5 <- class5mean - (class9mean + class12mean)/2
c9 <- class9mean - class12mean
c(intercept = intercept, c3 = c3, c5 = c5, c9 = c9)
intercept c3 c5 c9
1.91978919 -0.15512122 -0.05323887 0.15263158
Call:
lm(formula = dm1 ~ H3 + H5 + H9, data = depress)
Residuals:
Min 1Q Median 3Q Max
-0.85263 -0.25345 -0.00304 0.24423 1.14655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.91979 0.03965 48.420 <2e-16 ***
H3 -0.15512 0.08477 -1.830 0.0706 .
H5 -0.05324 0.09584 -0.555 0.5799
H9 0.15263 0.12125 1.259 0.2114
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared: 0.053, Adjusted R-squared: 0.02144
F-statistic: 1.679 on 3 and 90 DF, p-value: 0.1771
1.4 Effect coding
Effect coding又被称为contrast coding,也被称为deviation coding,其编码方案如下:
class | class3 | class5 | class9 |
---|---|---|---|
3 | 1 | 0 | 0 |
5 | 0 | 1 | 0 |
9 | 0 | 0 | 1 |
12 | -1 | -1 | -1 |
将class3、class5、class9纳为自变量,得到回归方程dm1 = intercept*1 + c3*class3 + c5*class5 + c9*class9
,其中intercept为4个班的均值的均值。class3的回归系数c3将3班的均值与4个班的均值的均值进行比较,c3等于(class3mean - (class3mean + class5mean + class9mean + class12mean)/3)。class5的回归系数c5将5班的均值与4个班的均值的均值进行比较。class9的回归系数c9将9班的均值与4个班的均值的均值进行比较。
# Effect coding using recode() in car package
depress$class3 <- recode(depress$class, "3 = 1; 12 = -1; else = 0")
depress$class5 <- recode(depress$class, "5 = 1; 12 = -1; else = 0")
depress$class9 <- recode(depress$class, "9 = 1; 12 = -1; else = 0")
# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c3 <- class3mean - mean(c(class3mean, class5mean, class9mean, class12mean))
c5 <- class5mean - mean(c(class3mean, class5mean, class9mean, class12mean))
c9 <- class9mean - mean(c(class3mean, class5mean, class9mean, class12mean))
c(intercept = intercept, c3 = c3, c5 = c5, c9 = c9)
intercept c3 c5 c9
1.919789194 -0.116340919 0.003287729 0.132842384
Call:
lm(formula = dm1 ~ class3 + class5 + class9, data = depress)
Residuals:
Min 1Q Median 3Q Max
-0.85263 -0.25345 -0.00304 0.24423 1.14655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.919789 0.039649 48.420 <2e-16 ***
class3 -0.116341 0.063577 -1.830 0.0706 .
class5 0.003288 0.065780 0.050 0.9602
class9 0.132842 0.073089 1.818 0.0725 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared: 0.053, Adjusted R-squared: 0.02144
F-statistic: 1.679 on 3 and 90 DF, p-value: 0.1771
2 SPSS中的编码方案
- Simple
- Deviation
- Difference
- Helmert
- Repeated
- Polynomial
2.1 Simple coding
编码方案如下:
class | class5 | class9 | class12 |
---|---|---|---|
3 | -1 | -1 | -1 |
5 | 3 | -1 | -1 |
9 | -1 | 3 | -1 |
12 | -1 | -1 | 3 |
注:class5 = 将5班与3班的均值进行对比;class9 = 将9班与3班的均值进行对比;class12 = 将12班与3班的均值进行对比。
我们将class5、class9、class12纳为回归方程的自变量,建立回归模型dm1 ~ intercept*1 + c5*class5 + c9*class9 + c12*class12
。截距intercept表示四个班级的dm1的均值的均值。class5的回归系数c5检验5班与3班的差异的显著性,该系数等于5班均值与3班均值的差值的1/4[(class5mean - class3mean)/4]。class9的回归系数c9检验9班与3班的差异,class12的回归系数c12检验12班与3班的差异。
Simple coding与回归分析的操作如下:
# Simple coding
depress$class5 <- ifelse(depress$class == 5, 3, -1)
depress$class9 <- ifelse(depress$class == 9, 3, -1)
depress$class12 <- ifelse(depress$class == 12, 3, -1)
# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c5 <- (class5mean - class3mean)/4
c9 <- (class9mean - class3mean)/4
c12 <- (class12mean - class3mean)/4
c(intercept = intercept, c5 = c5, c9 = c9, c12 = c12)
intercept c5 c9 c12
1.91978919 0.02990716 0.06229583 0.02413793
Call:
lm(formula = dm1 ~ class5 + class9 + class12, data = depress)
Residuals:
Min 1Q Median 3Q Max
-0.85263 -0.25345 -0.00304 0.24423 1.14655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.91979 0.03965 48.420 <2e-16 ***
class5 0.02991 0.02556 1.170 0.2450
class9 0.06230 0.02793 2.231 0.0282 *
class12 0.02414 0.02750 0.878 0.3825
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared: 0.053, Adjusted R-squared: 0.02144
F-statistic: 1.679 on 3 and 90 DF, p-value: 0.1771
我们可以对编码系数进行调整,使class5的回归系数c5等于5班均值与3班均值的差值[(class5mean - class3mean)/4]:
# Simple coding
depress$class5 <- ifelse(depress$class == 5, 3/4, -1/4)
depress$class9 <- ifelse(depress$class == 9, 3/4, -1/4)
depress$class12 <- ifelse(depress$class == 12, 3/4, -1/4)
# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c5 <- (class5mean - class3mean)
c9 <- (class9mean - class3mean)
c12 <- (class12mean - class3mean)
c(intercept = intercept, c5 = c5, c9 = c9, c12 = c12)
intercept c5 c9 c12
1.91978919 0.11962865 0.24918330 0.09655172
Call:
lm(formula = dm1 ~ class5 + class9 + class12, data = depress)
Residuals:
Min 1Q Median 3Q Max
-0.85263 -0.25345 -0.00304 0.24423 1.14655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.91979 0.03965 48.420 <2e-16 ***
class5 0.11963 0.10222 1.170 0.2450
class9 0.24918 0.11171 2.231 0.0282 *
class12 0.09655 0.11001 0.878 0.3825
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared: 0.053, Adjusted R-squared: 0.02144
F-statistic: 1.679 on 3 and 90 DF, p-value: 0.1771
2.2 Helmert coding
Helmert coding前文已介绍过。
2.3 Difference coding
Difference coding又称Reverse Helmert Coding,其编码顺序与Helmert coding相反,Difference coding编码方案如下:
class | D5 | D9 | D12 |
---|---|---|---|
3 | -1 | -1 | -1 |
5 | 1 | -1 | -1 |
9 | 0 | 2 | -1 |
12 | 0 | 0 | 3 |
2.4 Polynomial coding
Polynomial coding适用于勾画变量的变化趋势,例如:从3班到5班到9班到12班,dm1是否表现出线性变化趋势呢?我们可以用Polynomial coding对此进行考察。class包含4个类别,需要编码为3个codes,在Polynomial coding中,这三个codes分别代表线性趋势、二次曲线趋势、三次曲线趋势。Polynomial coding编码方案如下:
class | L | Q | C |
---|---|---|---|
3 | -3 | 1 | -1 |
5 | -1 | -1 | 3 |
9 | 1 | -1 | -3 |
12 | 3 | 1 | 1 |
# Polynomial coding using recode() in car package
depress$L <- recode(depress$class, "3 = -3; 5 = -1; 9 = 1; 12 = 3")
depress$Q <- recode(depress$class, "3 = 1; 5 = -1; 9 = -1; 12 = 1")
depress$C <- recode(depress$class, "3 = -1; 5 = 3; 9 = -3; 12 = 1")
# regression analysis
summary(lm(dm1 ~ L + Q + C, depress))
Call:
lm(formula = dm1 ~ L + Q + C, data = depress)
Residuals:
Min 1Q Median 3Q Max
-0.85263 -0.25345 -0.00304 0.24423 1.14655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.91979 0.03965 48.420 <2e-16 ***
L 0.02096 0.01746 1.200 0.2332
Q -0.06807 0.03965 -1.717 0.0895 .
C -0.01461 0.01800 -0.812 0.4192
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared: 0.053, Adjusted R-squared: 0.02144
F-statistic: 1.679 on 3 and 90 DF, p-value: 0.1771
2.5 Repeated coding
Repeated coding又被称为backward difference coding,其编码方案如下:
class | class5v3 | class9v5 | class12v9 |
---|---|---|---|
3 | -3/4 | -1/2 | -1/4 |
5 | 1/4 | -1/2 | -1/4 |
9 | 1/4 | 1/2 | -1/4 |
12 | 1/4 | 1/2 | 3/4 |
将class5v3、class9v5、class12v9纳为自变量,得到回归方程dm1 = intercept*1 + c5v3*class5v3 + c9v5*class9v5 + c12v9*class12v9
,其中intercept为3班、5班、9班、12班的均值的均值。class5v3的回归系数c5v3检验3班与5班的差异,c5v3等于(class5mean - class3mean)。
# Repeated coding using recode() in car package
depress$class5v3 <- recode(depress$class, "3 = -3/4; else = 1/4")
depress$class9v5 <- recode(depress$class, "c(3, 5) = -1/2; else = 1/2")
depress$class12v9 <- recode(depress$class, "12= 3/4; else = -1/4")
# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c5v3 <- class5mean - class3mean
c9v5 <- class9mean - class5mean
c12v9 <- class12mean - class9mean
c(intercept = intercept, c5v3 = c5v3, c9v5 = c9v5, c12v9 = c12v9)
intercept c5v3 c9v5 c12v9
1.9197892 0.1196286 0.1295547 -0.1526316
Call:
lm(formula = dm1 ~ class5v3 + class9v5 + class12v9, data = depress)
Residuals:
Min 1Q Median 3Q Max
-0.85263 -0.25345 -0.00304 0.24423 1.14655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.91979 0.03965 48.420 <2e-16 ***
class5v3 0.11963 0.10222 1.170 0.245
class9v5 0.12955 0.11424 1.134 0.260
class12v9 -0.15263 0.12125 -1.259 0.211
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared: 0.053, Adjusted R-squared: 0.02144
F-statistic: 1.679 on 3 and 90 DF, p-value: 0.1771
3 其他编码方案
3.1 Forward difference coding
我们调整backward difference coding(Repeated coding)的比较方向,就得到了Forward difference coding这一编码方案。
class | class3v5 | class5v9 | class9v12 |
---|---|---|---|
3 | 3/4 | 1/2 | 1/4 |
5 | -1/4 | 1/2 | 1/4 |
9 | -1/4 | -1/2 | 1/4 |
12 | -1/4 | -1/2 | -3/4 |
3.2 自定义编码方案
我们可以使用自定义编码方案检验我们想要检验的任何差异。在下面的编码方案中,code1将3班与5班进行比较,code2将5班与9班、12班的均值的均值进行比较,code3将12班与3班、5班、9班的均值的均值进行比较。
class | code1 | code2 | code3 |
---|---|---|---|
3 | 1 | 0 | -1/3 |
5 | -1 | 1 | -1/3 |
9 | 0 | -1/2 | -1/3 |
12 | 0 | -1/2 | 1 |
首先定义这个编码矩阵:
code1 <- c( 1, -1, 0, 0)
code2 <- c( 0, 1, -1/2, -1/2)
code3 <- c(-1/3, -1/3, -1/3, 1)
# use fractions() in MASS package
library(MASS)
mat <- fractions(cbind(code1, code2, code3))
rownames(mat) <- c("class 3", "class 5", "class 9", "class 12")
mat
code1 code2 code3
class 3 1 0 -1/3
class 5 -1 1 -1/3
class 9 0 -1/2 -1/3
class 12 0 -1/2 1
然后使用MASS
中ginv()
的对其进行转换,得到我们要在回归分析中实际使用的矩阵:
mat_final <- fractions(ginv(t(mat)))
colnames(mat_final) <- colnames(mat)
rownames(mat_final) <- c("class 3", "class 5", "class 9", "class 12")
mat_final
code1 code2 code3
class 3 3/4 1/2 0
class 5 -1/4 1/2 0
class 9 -1/2 -1 -3/4
class 12 0 0 3/4
之后进行回归分析,回归方程是dm1 ~ intercept*1 + c1*code1 + c2*code2 + c3*code3
。
# Custom coding using recode() in car package
depress$code1 <- recode(depress$class, "3 = 3/4; 5 = -1/4; 9 = -1/2; 12 = 0")
depress$code2 <- recode(depress$class, "3 = 1/2; 5 = 1/2; 9 = -1; 12 = 0")
depress$code3 <- recode(depress$class, "3 = 0; 5 = 0; 9 = -3/4; 12 = 3/4")
# compute regression parameters through the means of classes
intercept <- mean(c(class3mean, class5mean, class9mean, class12mean))
c1 <- class3mean - class5mean
c2 <- class5mean - mean(c(class9mean, class12mean))
c3 <- class12mean - mean(c(class3mean, class5mean, class9mean))
c(intercept = intercept, c1 = c1, c2 = c2, c3 = c3)
intercept c1 c2 c3
1.91978919 -0.11962865 -0.05323887 -0.02638559
Call:
lm(formula = dm1 ~ code1 + code2 + code3, data = depress)
Residuals:
Min 1Q Median 3Q Max
-0.85263 -0.25345 -0.00304 0.24423 1.14655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.91979 0.03965 48.420 <2e-16 ***
code1 -0.11963 0.10222 -1.170 0.245
code2 -0.05324 0.09584 -0.555 0.580
code3 -0.02639 0.09572 -0.276 0.783
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared: 0.053, Adjusted R-squared: 0.02144
F-statistic: 1.679 on 3 and 90 DF, p-value: 0.1771
注意,前文没有强调截距的编码。对于有k个水平的分类自变量,我们需要将其编码为k-1个变量。那截距呢?如果将截距考虑在内,完整的编码方案应当包含对截距的编码code0。code0的每一个系数都是1/4,表示我们想让截距表示4个水平的均值的均值。
code0 <- c( 1/4, 1/4, 1/4, 1/4)
code1 <- c( 1, -1, 0, 0)
code2 <- c( 0, 1, -1/2, -1/2)
code3 <- c(-1/3, -1/3, -1/3, 1)
mat <- fractions(cbind(code0, code1, code2, code3))
rownames(mat_final) <- c("class 3", "class 5", "class 9", "class 12")
rownames(mat) <- c("class 3", "class 5", "class 9", "class 12")
mat
code0 code1 code2 code3
class 3 1/4 1 0 -1/3
class 5 1/4 -1 1 -1/3
class 9 1/4 0 -1/2 -1/3
class 12 1/4 0 -1/2 1
注意,上面的矩阵并不是我们最终在回归分析中使用的矩阵,我们需要使用MASS
中ginv()
的对其进行转换:
mat_final <- fractions(ginv(t(mat)))
colnames(mat_final) <- colnames(mat)
rownames(mat_final) <- c("class 3", "class 5", "class 9", "class 12")
mat_final
code0 code1 code2 code3
class 3 1 3/4 1/2 0
class 5 1 -1/4 1/2 0
class 9 1 -1/2 -1 -3/4
class 12 1 0 0 3/4
上面的矩阵是我们最终在回归分析中使用的矩阵。对截距来说,与一般的回归分析一致,我们将截距的自变量编码为1
。这里的1
实际上是一个向量,它有n 个维度(n 是样本量),每一个case的值都是1。在完整的回归方程中,例如dm1 ~ intercept*1 + c1*code1 + c2*code2 + c3*code3
,截距项是intercept*1
,intercept
与c1
一样,是一个未知的、有待估计的参数,而1
与code1
一样,是一个已知的、不需要估计的变量(向量)。
我们也可以自定义截距,让截距表示3班的均值。
# compute codes
code0 <- c( 1, 0, 0, 0)
code1 <- c( 1, -1, 0, 0)
code2 <- c( 0, 1, -1/2, -1/2)
code3 <- c(-1/3, -1/3, -1/3, 1)
mat <- fractions(cbind(code0, code1, code2, code3))
rownames(mat) <- c("class 3", "class 5", "class 9", "class 12")
mat
code0 code1 code2 code3
class 3 1 1 0 -1/3
class 5 0 -1 1 -1/3
class 9 0 0 -1/2 -1/3
class 12 0 0 -1/2 1
mat_final <- fractions(ginv(t(mat)))
rownames(mat_final) <- c("class 3", "class 5", "class 9", "class 12")
colnames(mat_final) <- colnames(mat)
mat_final
code0 code1 code2 code3
class 3 1 0 0 0
class 5 1 -1 0 0
class 9 1 -5/4 -3/2 -3/4
class 12 1 -3/4 -1/2 3/4
# Custom coding using recode() in car package
depress$code1 <- recode(depress$class, "3 = 0; 5 = -1; 9 = -5/4; 12 = -3/4")
depress$code2 <- recode(depress$class, "3 = 0; 5 = 0; 9 = -3/2; 12 = -1/2")
depress$code3 <- recode(depress$class, "3 = 0; 5 = 0; 9 = -3/4; 12 = 3/4")
# compute regression parameters through the means of classes
intercept <- class3mean
c1 <- class3mean - class5mean
c2 <- class5mean - mean(c(class9mean, class12mean))
c3 <- class12mean - mean(c(class3mean, class5mean, class9mean))
c(intercept = intercept, c1 = c1, c2 = c2, c3 = c3)
intercept c1 c2 c3
1.80344828 -0.11962865 -0.05323887 -0.02638559
Call:
lm(formula = dm1 ~ code1 + code2 + code3, data = depress)
Residuals:
Min 1Q Median 3Q Max
-0.85263 -0.25345 -0.00304 0.24423 1.14655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.80345 0.07028 25.659 <2e-16 ***
code1 -0.11963 0.10222 -1.170 0.245
code2 -0.05324 0.09584 -0.555 0.580
code3 -0.02639 0.09572 -0.276 0.783
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3785 on 90 degrees of freedom
Multiple R-squared: 0.053, Adjusted R-squared: 0.02144
F-statistic: 1.679 on 3 and 90 DF, p-value: 0.1771
3.3 Pairwise coding
对于有3个分类水平的自变量,如果我们要进行两两配对比较,我们总共需要3次配对比较。前面讲到的编码方案只能实现2次配对比较。本文提出一种新的编码方案,利用截距,通过一次回归分析实现三分类自变量3个水平的3次配对比较。
首先我们根据上述目的设计矩阵。这里我们将3班、5班、9班的数据选出来,比较3个班的dm1的差异。注意,这里的p1v3相当于截距。
class | p3v1 | p1v2 | p2v3 |
---|---|---|---|
3 | -1 | 1 | 0 |
5 | 0 | -1 | 1 |
9 | 1 | 0 | -1 |
# compute codes
p1v3 <- c(-1, 0, 1)
p1v2 <- c( 1, -1, 0)
p2v3 <- c( 0, 1, -1)
mat <- fractions(cbind(p1v3, p1v2, p2v3))
rownames(mat) <- c("class 3", "class 5", "class 9")
mat
p1v3 p1v2 p2v3
class 3 -1 1 0
class 5 0 -1 1
class 9 1 0 -1
mat_final <- fractions(ginv(t(mat)))
rownames(mat_final) <- c("class 3", "class 5", "class 9")
colnames(mat_final) <- colnames(mat)
mat_final
p1v3 p1v2 p2v3
class 3 -1/3 1/3 0
class 5 0 -1/3 1/3
class 9 1/3 0 -1/3
接下来,我们进行回归分析,建立回归方程dm1 = c0*p1v3 + c1*p1v2 + c2*p2v3
,注意在这个回归方程中,c0*p1v3
是截距项,我们需要将lm()
默认的截距项intercept*1
去除。
# compute regression parameters through the means of classes
c0 <- class3mean - class9mean
c1 <- class3mean - class5mean
c2 <- class5mean - class9mean
c(c0 = c0, c1 = c1, c2 = c2)
c0 c1 c2
-0.2491833 -0.1196286 -0.1295547
# regression analysis
summary(lm(dm1 ~ p1v3 + p1v2 + p2v3 - 1, depress[depress$class %in% c(3, 5, 9), ]))
Call:
lm(formula = dm1 ~ p1v3 + p1v2 + p2v3 - 1, data = depress[depress$class %in%
c(3, 5, 9), ])
Residuals:
Min 1Q Median 3Q Max
1.083 1.437 1.797 2.230 3.320
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
p1v3 -1.1086 1.0285 -1.078 0.285
p1v2 -0.4591 0.9608 -0.478 0.634
p2v3 NA NA NA NA
Residual standard error: 1.961 on 72 degrees of freedom
Multiple R-squared: 0.01622, Adjusted R-squared: -0.01111
F-statistic: 0.5935 on 2 and 72 DF, p-value: 0.5551
结果显示,1 coefficient is not defined because of singularities.