求出$L$对$g$的梯度后,就开始隐层各个参数的计算了。还是要一步一步算,首先算$g$关于$i^2$的梯度。因为隐层的激活函数是ReLu,是按元素进行的运算,所以每个$g_k$只对它对应的那个$i^2_k$求导。导数也很简单,判断大小即可:
$\displaystyle E =\frac{\partial g}{\partial i^2} =\left[ \begin{matrix} \frac{\partial g_1}{\partial i_1^2}\\ \vdots\\ \frac{\partial g_6}{\partial i_6^2}\\ \end{matrix} \right] =\left[ \begin{matrix} \delta'(i_1^2)\\ \vdots\\ \delta'(i_6^2)\\ \end{matrix} \right] ,\;\;\; \delta'(x) = \left\{ \begin{matrix} 1,x\ge0\\ 0,x<0 \end{matrix} \right. $
再求$L$关于$i^2$的梯度,它和$L$关于$i^3$的梯度$(1)$式不同,因为激活函数的计算方式不同:
$\displaystyle F =\frac{\partial L}{\partial i^2} =\left[ \begin{matrix} \frac{\partial L}{\partial i_1^2}\\ \vdots\\ \frac{\partial L}{\partial i_6^2}\\ \end{matrix} \right] =\left[ \begin{matrix} \frac{\partial L}{\partial g_1}\frac{\partial g_1}{\partial i_1^2}\\ \vdots\\ \frac{\partial L}{\partial g_6}\frac{\partial g_6}{\partial i_6^2}\\ \end{matrix} \right] = E\times D $
为了容易理解,在求$L$对$w^2$的梯度之前,先求$i^2$对$w^2$的导数,与$(2)$式类似,求出的是6行5列的矩阵:
$\displaystyle \frac{\partial i^2}{\partial w^2} =\left[ \begin{matrix} \frac{\partial i_1^2}{\partial w_{11}^2}&\cdots&\frac{\partial i_1^2}{\partial w_{15}^2}\\ \vdots&\vdots&\vdots\\ \frac{\partial i_6^2}{\partial w_{61}^2}&\cdots&\frac{\partial i_6^2}{\partial w_{65}^2}\\ \end{matrix} \right] =\left[ \begin{matrix} h_1&\cdots&h_5\\ &\vdots&\\ h_1&\cdots&h_5\\ \end{matrix} \right] =\left.\left[ \begin{matrix} h^T\\ \vdots\\ h^T\\ \end{matrix} \right]\right\} 6 \;r $
与$(3)$式类似,$L$对$w^2$的梯度就是:
$\displaystyle \frac{\partial L}{\partial w^2} =\left[ \begin{matrix} \frac{\partial L}{\partial i_1^2}\frac{\partial i_1^2}{\partial w_{11}^2}&\cdots&\frac{\partial L}{\partial i_1^2}\frac{\partial i_1^2}{\partial w_{15}^2}\\ \vdots&\vdots&\vdots\\ \frac{\partial L}{\partial i_6^2}\frac{\partial i_6^2}{\partial w_{61}^2}&\cdots&\frac{\partial L}{\partial i_6^2}\frac{\partial i_6^2}{\partial w_{65}^2}\\ \end{matrix} \right] = \left[FFFFF\right]\times \left.\left[ \begin{matrix} h^T\\ \vdots\\ h^T\\ \end{matrix} \right]\right\}6 \;r =F\cdot h^T $
与$(4)$式类似,$L$对$b^2$的梯度为:
$\displaystyle \frac{\partial L}{\partial b^2}= \frac{\partial L}{\partial i^2}=F $
现在,输出层与隐层的参数梯度已经计算完毕了。还剩输入层,它与隐层的唯一区别就在于层中元素数量不同,而传播与求梯度的方法和隐层是一样的,只需对输入层进行$(5)$式及以后的相应操作即可。