超详细的softmax的反向传播梯度计算推导

Softmax及其求导

正向传播

为了方便理解,若输入,输出只有3个变量(下面的普通性情况不太理解的可以带入此特殊情况帮助理解)。

  • 输入:输出层神经元 Z = [z 1 z_1z1, z 2 z_2z2, z 3 z_3z3],分类标签 Y = [y 1 , y 2 , y 3 y_1, y_2, y_3y1,y2,y3](Y是one-hot标签,只有一个y i y_iyi值为1,其他全为0)
  • 输出:A = softmax(Z) = [a 1 , a 2 , a 3 a_1, a_2, a_3a1,a2,a3]

更一般的,假设有n个神经元(或者说n类),softmax公式为:
a i = e z i ∑ j = 1 n e z j a_i = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}}ai=j=1nezjezi
损失函数采用交叉熵,公式为:
L = − ∑ i = 1 n y i l n a i L = -\sum_{i=1}^n y_ilna_iL=i=1nyilnai

求导

先说结论,按向量形式表示∂ L ∂ Z = A − Y \frac{\partial L}{\partial Z} = A - YZL=AY

证明:在这里我们假设第k个神经元为正确标签,即在Y = [y 1 , y 2 , y 3 y_1, y_2, y_3y1,y2,y3]中y k y_kyk = 1,其他y i y_iyi都为0。

首先求L对A的导数
∂ L ∂ a i = ∂ − ∑ i = 1 n y i l n a i ∂ a i = − y i a i \frac{\partial L}{\partial a_i} = \frac{\partial -\sum_{i=1}^n y_ilna_i}{\partial a_i}=-\frac{y_i}{a_i}aiL=aii=1nyilnai=aiyi

再求L对Z的导数,这里需要注意,在正向传播时,每一个a i a_iai的计算都有所有的z j z_jzj参加(请看softmax的公式的分母,是求和)
∂ L ∂ z i = ∑ j = 1 n ∂ L a j ⋅ ∂ a j ∂ z i = ∑ j = 1 n − y j a j ⋅ ∂ a j ∂ z i \frac{\partial L}{\partial z_i} = \sum_{j=1}^n \frac{\partial L}{a_j} \cdot \frac{\partial a_j}{\partial z_i} = \sum_{j=1}^n -\frac{y_j}{ a_j} \cdot \frac{\partial a_j}{\partial z_i}ziL=j=1najLziaj=j=1najyjziaj

再次强调,因为每一个a i a_iai的计算都由所有的z j z_jzj参加,换句话说,每一个z j z_jzj都包含在每一个a i a_iai中,因此对z j z_jzj求偏导要先对所有的a i a_iai求偏导。

而我们事先假设好了y k y_kyk = 1,其他y i y_iyi都为0,因此只有∂ L ∂ a k \frac{\partial L}{\partial a_k}akL不为0,其他都为0,可以进一步将上式化简为:
∂ L ∂ z i = ∂ L ∂ a k ⋅ ∂ a k ∂ z i \frac{\partial L}{\partial z_i} = \frac{\partial L}{\partial a_k} \cdot \frac{\partial a_k}{\partial z_i}ziL=akLziak
由此可见,我们的重点就是要求∂ a k ∂ z i \frac{\partial a_k}{\partial z_i}ziak了。对于∂ a k ∂ z i \frac{\partial a_k}{\partial z_i}ziak的求解,要分两种情况。

  • 若i = k
    ∂ a k ∂ z i = ∂ a k ∂ z k = ∂ ( e z k ∑ j = 1 n e z j ) ∂ z k = e z k ( ∑ j = 1 n e z j ) − ( e z k ) 2 ( ∑ j = 1 n e z j ) 2 = e z k ∑ j = 1 n e z j − ( e z k ∑ j = 1 n e z j ) 2 = a k − a k 2 = a k ( 1 − a k ) \frac{\partial a_k}{\partial z_i} = \frac{\partial a_k}{\partial z_k} = \frac{\partial (\frac{e^{z_k}}{\sum_{j=1}^n e^{z_j}})}{\partial z_k} = \frac{e^{z_k}(\sum_{j=1}^n e^{z_j}) - (e^{z_k})^2}{(\sum_{j=1}^n e^{z_j})^2} = \frac{e^{z_k}}{\sum_{j=1}^n e^{z_j}} - (\frac{e^{z_k}}{\sum_{j=1}^n e^{z_j}})^2 \\= a_k - a_k^2 = a_k(1 - a_k)ziak=zkak=zk(j=1nezjezk)=(j=1nezj)2ezk(j=1nezj)(ezk)2=j=1nezjezk(j=1nezjezk)2=akak2=ak(1ak)

  • 若i ≠ \neq= k
    ∂ a k ∂ z i = ∂ ( e z k ∑ j = 1 n e z j ) ∂ z i = − e z k ⋅ e z i ( ∑ j = 1 n e z j ) 2 = − e z k ∑ j = 1 n e z j ⋅ e z i ∑ j = 1 n e z j = − a k a i \frac{\partial a_k}{\partial z_i} = \frac{\partial (\frac{e^{z_k}}{\sum_{j=1}^n e^{z_j}})}{\partial z_i} = \frac{-e^{z_k} \cdot e^{z_i}}{(\sum_{j=1}^n e^{z_j})^2} = -\frac{e^{z_k}}{\sum_{j=1}^n e^{z_j}} \cdot \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}} \\= -a_ka_iziak=zi(j=1nezjezk)=(j=1nezj)2ezkezi=j=1nezjezkj=1nezjezi=akai

结合我们求出的∂ L ∂ a i \frac{\partial L}{\partial a_i}aiL∂ a i ∂ z j \frac{\partial a_i}{\partial z_j}zjai,写出L对整个Z的导数,我们可得

∂ L ∂ Z = [ ∂ L z 1 . . . ∂ L z k . . . ∂ L z n ] = [ ∑ i = 1 n ∂ L a i ⋅ ∂ a i ∂ z 1 . . . ∑ i = 1 n ∂ L a i ⋅ ∂ a i ∂ z k . . . ∑ i = 1 n ∂ L a i ⋅ ∂ a i ∂ z n ] = [ ∂ L ∂ a k ⋅ ∂ a k ∂ z 1 . . . ∂ L ∂ a k ⋅ ∂ a k ∂ z k . . . ∂ L ∂ a k ⋅ ∂ a k ∂ z n ] = [ − 1 a k ⋅ ( − a k a 1 ) . . . − 1 a k ⋅ a k ( 1 − a k ) . . . − 1 a k ⋅ ( − a k a n ) ] = [ a 1 . . . a k − 1 . . . a n ] \frac{\partial L}{\partial Z} = \begin{bmatrix} \frac{\partial L}{z_1}\\ ... \\ \frac{\partial L}{z_k} \\ ... \\ \frac{\partial L}{z_n} \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^n \frac{\partial L}{a_i} \cdot \frac{\partial a_i}{\partial z_1} \\ ... \\ \sum_{i=1}^n \frac{\partial L}{a_i} \cdot \frac{\partial a_i}{\partial z_k}\\ ... \\ \sum_{i=1}^n \frac{\partial L}{a_i} \cdot \frac{\partial a_i}{\partial z_n} \end{bmatrix} = \begin{bmatrix} \frac{\partial L}{\partial a_k} \cdot \frac{\partial a_k}{\partial z_1} \\ ... \\ \frac{\partial L}{\partial a_k} \cdot \frac{\partial a_k}{\partial z_k} \\ ... \\ \frac{\partial L}{\partial a_k} \cdot \frac{\partial a_k}{\partial z_n}\end{bmatrix} = \begin{bmatrix} -\frac{1}{a_k} \cdot (-a_ka_1) \\ ... \\ -\frac{1}{a_k} \cdot a_k(1-a_k) \\ ... \\ -\frac{1}{a_k} \cdot (-a_ka_n) \end{bmatrix} = \begin{bmatrix} a1 \\ ...\\ a_k-1 \\ ... \\ a_n \end{bmatrix}ZL=z1L...zkL...znL=i=1naiLz1ai...i=1naiLzkai...i=1naiLznai=akLz1ak...akLzkak...akLznak=ak1(aka1)...ak1ak(1ak)...ak1(akan)=a1...ak1...an

我们又知道y k = 1 , 除 了 y k 其 他 y i 都 等 于 0 y_k = 1,除了y_k其他y_i都等于0yk=1ykyi0,因此上式可以进一步写为
∂ L ∂ Z = [ a 1 − y 1 . . . a k − y k . . . a n − y n ] = A − Y \frac{\partial L}{\partial Z} = \begin{bmatrix} a1 - y_1 \\ ...\\ a_k - y_k \\ ... \\ a_n - y_n \end{bmatrix} = A - YZL=a1y1...akyk...anyn=AY

证明完毕~


版权声明:本文为weixin_43217928原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。