问题一: 机器学习的基本流程( 九 )


其中I I I 为指示函数,I = { 1 if ( x ∈ R m ) 0 if ( x ? R m ) \{I}=\left\{\begin{array}{l}1 \text { if }\left(x \in R_{m}\right) \\ 0 \text { if }\left(x \notin R_{m}\right)\end{array}\right. I={1if(x∈Rm?)0if(x∈/?Rm?)?
三、示例
下表为训练数据集, 特征向量只有一维, 根据此数据表建立回归决策树.
x
y
5.56
5.7
5.91
6.4
6.8
7.05
8.9
8.7
9
9.05
(1) 选择最优切分变量j \{j} j 与最优切分点s \{s} s : 在本数据集中, 只有一个特征变量, 最优切分变量自然是x \{x} x .接下来考虑 9 个切分点{ 1.5 , 2.5 , 3.5 , 4.5 , 5.5 , 6.5 , 7.5 , 8.5 , 9.5 } \{1.5,2.5,3.5,4.5,5.5,6.5,7.5,8.5,9.5\} {1.5,2.5,3.5,4.5,5.5,6.5,7.5,8.5,9.5} (切分变量两个相邻取值区间[ a i , a i + 1 \left[a^{i}, a^{i+1}\right. [ai,ai+1 ) 内任一点均可), 根据式计算每个待切分点的损失函数值:
损失函数为
L ( j , s ) = ∑ x i ∈ R 1 ( j , s ) ( y i ? c 1 ^ ) 2 + ∑ x i ∈ R 2 ( j , s ) ( y i ? c 2 ^ ) 2 L(j, s)=\sum_{x_{i} \in R_{1}(j, s)}\left(y_{i}-\{c_{1}}\right)^{2}+\sum_{x_{i} \in R_{2}(j, s)}\left(y_{i}-\{c_{2}}\right)^{2} L(j,s)=xi?∈R1?(j,s)∑?(yi??c1??)2+xi?∈R2?(j,s)∑?(yi??c2??)2
其中c 1 ^ = 1 N 1 ∑ x i ∈ R 1 ( j , s ) y i , c 2 ^ = 1 N 2 ∑ x i ∈ R 2 ( j , s ) y i \{c_{1}}=\frac{1}{N_{1}} \sum_{x_{i} \in R_{1}(j, s)} y_{i}, \{c_{2}}=\frac{1}{N_{2}} \sum_{x_{i} \in R_{2}(j, s)} y_{i} c1??=N1?1?∑xi?∈R1?(j,s)?yi?,c2??=N2?1?∑xi?∈R2?(j,s)?yi?.
a. 计算子区域输出值
当s = 1.5 \{s}=1.5 s=1.5 时, 两个子区域R 1 = { 1 } , R 2 = { 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 } , c 1 = 5.56 \{R} 1=\{1\}, \{R} 2=\{2,3,4,5,6,7,8,9,10\}, c_{1}=5.56 R1={1},R2={2,3,4,5,6,7,8,9,10},c1?=5.56,
c 2 = 1 9 ( 5.7 + 5.91 + 6.4 + 6.8 + 7.05 + 8.9 + 8.7 + 9 + 9.05 ) = 7.5 c_{2}=\frac{1}{9}(5.7+5.91+6.4+6.8+7.05+8.9+8.7+9+9.05)=7.5 c2?=91?(5.7+5.91+6.4+6.8+7.05+8.9+8.7+9+9.05)=7.5
同理, 得到其他各切分点的子区域输出值, 列表如下
s1.52.53.54.55.56.57.58.59.5
c_(1)
5.56
5.63
5.72
5.89
6.07
6.24
6.62
6.88
7.11
c_(2)
7.5
7.73
7.99
8.25
8.54
8.91
8.92
9.03
9.05
b. 计算损失函数值, 找到最优切分点
当s = 1.5 \{s}=1.5 s=1.5 时,
L ( 1.5 ) = ( 5.56 ? 5.56 ) 2 + [ ( 5.7 ? 7.5 ) 2 + ( 5.91 ? 7.5 ) 2 + ? + ( 9.05 ? 7.5 ) 2 ] = 0 + 15.72 = 15.72 \begin{} \{L}(1.5) &=(5.56-5.56)^{2}+\left[(5.7-7.5)^{2}+(5.91-7.5)^{2}+\cdots+(9.05-7.5)^{2}\right] \\ &=0+15.72 \\ &=15.72 \end{} L(1.5)?=(5.56?5.56)2+[(5.7?7.5)2+(5.91?7.5)2+?+(9.05?7.5)2]=0+15.72=15.72?
同理, 计算得到其他各切分点的损失函数值, 列表如下
s1.52.53.54.55.56.57.58.59.5
L(s)
15.72
12.07
8.36
5.78
3.91
1.93
8.01
11.73
15.74
易知, 取s = 6.5 s=6.5 s=6.5 时, 损失函数值最小.因此, 第一个划分点为( j = x , s = 6.5 ) (j=x, s=6.5) (j=x,s=6.5).
(2) 用选定的对( j , s ) (j, s) (j,s) 划分区域并决定相应的输出值:
划分区域为:R 1 = { 1 , 2 , 3 , 4 , 5 , 6 } , R 2 = { 7 , 8 , 9 , 10 } R_{1}=\{1,2,3,4,5,6\}, R_{2}=\{7,8,9,10\} R1?={1,2,3,4,5,6},R2?={7,8,9,10}
对应输出值:c 1 = 6.24 , c 2 = 8.91 c_{1}=6.24, c_{2}=8.91 c1?=6.24,c2?=8.91
(3) 调用步骤(1),(2), 继续划分:
对 R 1 ,取切分点 { 1.5 , 2.5 , 3.5 , 4.5 , 5.5 } ,计算得到单元输出值为 \text { 对 } R_{1} \text {, 取切分点 }\{1.5,2.5,3.5,4.5,5.5\} \text {, 计算得到单元输出值为 } 对R1?,取切分点{1.5,2.5,3.5,4.5,5.5},计算得到单元输出值为
s1.52.53.54.55.5
c_(1)
5.56
5.63
5.72
5.89
6.07
c_(2)
6.37
6.54
6.75
6.93
7.05
损失函数值为
s1.52.53.54.55.5
L(s)
1.3087
0.754