$\begin{gather} C =\frac{\partial L}{\partial i^3} = \left[ \begin{matrix} \frac{\partial L}{\partial i^3_1}\\ \frac{\partial L}{\partial i^3_2}\\ \frac{\partial L}{\partial i^3_3}\\ \end{matrix} \right] =\left[ \begin{matrix} \sum_{j=1}^3\frac{\partial L}{\partial f_j}\frac{\partial f_j}{\partial i^3_1}\\ \sum_{j=1}^3\frac{\partial L}{\partial f_j}\frac{\partial f_j}{\partial i^3_2}\\ \sum_{j=1}^3\frac{\partial L}{\partial f_j}\frac{\partial f_j}{\partial i^3_3}\\ \end{matrix} \right] =B^T\cdot A \label{}\end{gather}$
接下来就是求$L$对参数$w^3$的梯度,但是它们中间有个$i^3$隔着,所以要先求$i^3$对$w$的导数。$i^3$是向量,$w^3$是矩阵,如果一一对应求导就变成了一个三维的张量(不知道有没有雅可比张量的说法)。但是可以发现,每个$w^3_{jk}$($j$行$k$列)只参与$i^3_j$的计算,所以每个$w^3_{jk}$只需求其对应的$i^3_j$的导数即可,求出的是一个矩阵(这一步在代码中并不需要计算,只是为了容易理解):
$\begin{gather}\displaystyle \frac{\partial i^3}{\partial w^3} =\left[ \begin{matrix} \frac{\partial i_1^3}{\partial w_{11}^3}&\cdots&\frac{\partial i_1^3}{\partial w_{16}^3}\\ \frac{\partial i_2^3}{\partial w_{21}^3}&\cdots&\frac{\partial i_2^3}{\partial w_{26}^3}\\ \frac{\partial i_3^3}{\partial w_{31}^3}&\cdots&\frac{\partial i_3^3}{\partial w_{36}^3}\\ \end{matrix} \right] =\left[ \begin{matrix} g_1&\cdots &g_6\\ g_1&\cdots &g_6\\ g_1&\cdots &g_6\\ \end{matrix} \right] =\left[ \begin{matrix} g_1^T\\ g_1^T\\ g_1^T\\ \end{matrix} \right] \label{}\end{gather}$
于是$L$对$w^3$的梯度就是($\times$表示按元素进行的乘法):
$\begin{gather}\displaystyle \frac{\partial L}{\partial w^3} =\left[ \begin{matrix} \frac{\partial L}{\partial i_1^3}\frac{\partial i_1^3}{\partial w_{11}^3}&\cdots&\frac{\partial L}{\partial i_1^3}\frac{\partial i_1^3}{\partial w_{16}^3}\\ \frac{\partial L}{\partial i_2^3}\frac{\partial i_2^3}{\partial w_{21}^3}&\cdots&\frac{\partial L}{\partial i_2^3}\frac{\partial i_2^3}{\partial w_{26}^3}\\ \frac{\partial L}{\partial i_3^3}\frac{\partial i_3^3}{\partial w_{31}^3}&\cdots&\frac{\partial L}{\partial i_3^3}\frac{\partial i_3^3}{\partial w_{36}^3}\\ \end{matrix} \right] =\left[ CCCCCC \right] \times \left[ \begin{matrix} g^T\\ g^T\\ g^T\\ \end{matrix} \right] =C\cdot g^T \label{}\end{gather}$
接下来求$L$关于$b^3$的导数,因为$b^3_k$的只参与$i^3_k$的计算,且导数为1,所以很简单:
$\begin{gather} \displaystyle \frac{\partial L}{\partial b^3} =\left[ \begin{matrix} \frac{\partial L}{\partial i^3_1}\frac{\partial i^3_1}{\partial b^3_1}\\ \frac{\partial L}{\partial i^3_2}\frac{\partial i^3_2}{\partial b^3_2}\\ \frac{\partial L}{\partial i^3_3}\frac{\partial i^3_3}{\partial b^3_3}\\ \end{matrix} \right] =\frac{\partial L}{\partial i^3}=C \label{}\end{gather}$
然后就要向前传播到隐层了,要算出$L$对$g$的梯度,首先要算出$i^3$对$g$的导数(这一步在代码中并不需要计算,只是为了容易理解):
$\begin{gather} \displaystyle \frac{\partial i^3}{\partial g} =\left[ \begin{matrix} \frac{\partial i_1^3}{\partial g_1}&\cdots&\frac{\partial i_1^3}{\partial g_6}\\ \frac{\partial i_2^3}{\partial g_1}&\cdots&\frac{\partial i_2^3}{\partial g_6}\\ \frac{\partial i_3^3}{\partial g_1}&\cdots&\frac{\partial i_3^3}{\partial g_6}\\ \end{matrix} \right] =w^3 \label{}\end{gather}$
分析可以发现,每个$g_j$都参与了所有$i^3_k$的计算,所以$L$对$g_j$的导数由多元函数链式法则可得:
$\displaystyle D =\frac{\partial L}{\partial g} =\left[ \begin{matrix} \frac{\partial L}{\partial g_1}\\ \vdots\\ \frac{\partial L}{\partial g_6}\\ \end{matrix} \right] =\left[ \begin{matrix} \frac{\partial L}{\partial i_1^3}\frac{\partial i_1^3}{\partial g_1} +\frac{\partial L}{\partial i_2^3}\frac{\partial i_2^3}{\partial g_1} +\frac{\partial L}{\partial i_3^3}\frac{\partial i_3^3}{\partial g_1}\\ \vdots\\ \frac{\partial L}{\partial i_1^3}\frac{\partial i_1^3}{\partial g_6} +\frac{\partial L}{\partial i_2^3}\frac{\partial i_2^3}{\partial g_6} +\frac{\partial L}{\partial i_3^3}\frac{\partial i_3^3}{\partial g_6}\\ \end{matrix} \right] ={w^3}^T\cdot C $