神经网络求导 (2)

首先求标量形式的导数,即第 \(i\) 个输出对于第 \(j\) 个输入的偏导:
\[ \frac{\partial y_i}{\partial z_j} = \frac{\partial\, \frac{e^{z_i}}{\sum_{k=1}^{C} e^{a_k}}}{\partial z_j} \]
其中 \(e^{z_i}\)\(z_j\) 求导要分情况讨论,即:
\[ \frac{\partial e^{z_i}}{\partial z_j} = \begin{cases} e^{z_i}, & \text{if} \;\;\; i = j \\[1ex] 0, & \text{if} \;\;\; i \neq j \end{cases} \]
那么当 \(i =j\) 时:
\[ \frac{\partial y_i}{\partial z_j} = \frac{e^{z_i} \sum_{k=1}^Ce^{z_k} - e^{z_i}e^{z_j}}{\left(\sum_{k=1}^C e^{z_k}\right)^2} = \frac{e^{z_i}}{\sum_{k=1}^C e^{z_k}} - \frac{e^{z_i}}{\sum_{k=1}^C e^{z_k}} \frac{e^{z_j}}{\sum_{k=1}^C e^{z_k}} =y_i - y_i y_j \tag{1.1} \]
\(i \neq j\) 时:
\[ \frac{\partial y_i}{\partial z_j} = \frac{0 - e^{z_i}e^{z_j}}{\left(\sum_{k=1}^C e^{z_k}\right)^2} = -y_iy_j \tag{1.2} \]
于是二者综合:
\[ \frac{\partial y_i}{\partial z_j} = \mathbf{\large1} \{i=j\}\, y_i - y_i\,y_j \tag{1.3} \]
其中 \(\mathbf{\large 1} \{i=j\} = \begin{cases}1, & \text{if} \;\;\; i = j \\0, & \text{if} \;\;\; i \neq j\end{cases}\)


\(\text{softmax}\) 函数的输入为K 维向量 \(\mathbf{z} = [z_1, z_2, ..., z_K]^\text{T}\) 时,转换形式为 \(\mathbb{R}^K \rightarrow \mathbb{R}^K\)
\[ \mathbf{y} = \text{softmax}(\mathbf{z}) = \frac{1}{\sum_{k=1}^K e^{z_k}} \begin{bmatrix} e^{z_1} \\ e^{z_2} \\ \vdots \\ e^{z_K} \end{bmatrix} \]
其导数同样为 \(\text{Jabocian}\) 矩阵 ( 同时利用 \((1.1)\)\((1.2)\) 式 ):
\[ \begin{align*} \frac{\partial\, \mathbf{y}}{\partial\, \mathbf{z}} & = \begin {bmatrix} \frac{\partial y_1}{\partial z_1} & \frac{\partial y_1}{\partial z_2} & \cdots & \frac{\partial y_1}{\partial z_K} \\ \frac{\partial y_2}{\partial z_1} & \frac{\partial y_2}{\partial z_2} & \cdots & \frac{\partial y_2}{\partial z_K} \\ \vdots & \vdots & \ddots \\ \frac{\partial y_K}{\partial z_1} &\frac{\partial y_K}{\partial z_2} & \cdots & \frac{\partial y_K}{\partial z_K} \end {bmatrix} \\[2ex] & = \begin {bmatrix} \small{y_1 - y_1 y_1} & \small{-y_1y_2} & \cdots & \small{-y_1 y_K} \\ \small{-y_2y_1} & \small{y_2 - y_2 y_1} & \cdots & \small{-y_2 y_K} \\ \vdots & \vdots & \ddots \\ \small{-y_Ky_1} & \small{-y_K y_2} & \cdots & \small{y_K - y_K y_K} \end {bmatrix} \\[2.5ex] & = \text{diag}(\mathbf{y}) - \mathbf{y}\mathbf{y}^\text{T} \\[0.5ex] &= \text{diag}(\text{softmax}(\mathbf{z})) - \text{softmax}(\mathbf{z})\, \text{softmax}(\mathbf{z})^\text{T} \end{align*} \]




交叉熵损失函数

交叉熵损失有两种表示形式,设真实标签为 \(y\) ,预测值为 \(a\)

(一) \(y\) 为标量,即 \(y \in \mathbb{R}\) ,则交叉熵损失为:
\[ \mathcal{L}(y, a) = - \sum\limits_{j=1}^{k} \mathbf{\large 1}\{y = j\}\, \text{log}\, a_j \]
(二) \(y\) 为one-hot向量,即 \(y = \left[0,0...1...0\right]^\text{T} \in \mathbb{R}^k\) ,则交叉熵损失为:
\[ \mathcal{L}(y, a) = -\sum\limits_{j=1}^k y_j\, \text{log}\, a_j \]




交叉熵损失函数 + Sigmoid激活函数

已知 \(\mathcal{L}(y, a) = -\sum\limits_{j=1}^k y_j\, \text{log}\, a_j\)\(a_j = \sigma(z_j) = \frac{1}{1+e^{\,-z_j}}\) ,求 \(\frac{\partial \mathcal{L}}{z_j}\)
\[ \frac{\partial \mathcal{L}}{\partial z_j} = \frac{\partial \mathcal{L}}{\partial a_j} \frac{\partial a_j}{\partial z_j} = -y_j \frac{1}{\sigma(z_j)} \sigma(z_j) (1 - \sigma(z_j)) = \sigma(z_j) - 1 = a_j - y_j \]




交叉熵损失函数 + Softmax激活函数

已知 \(\mathcal{L}(y, a) = -\sum\limits_{i=1}^k y_i\, \text{log}\, a_i\)\(a_j = \text{softmax}(z_j) = \frac{e^{z_j}}{\sum_{c=1}^C e^{z_c}}\) ,求 \(\frac{\partial \mathcal{L}}{\partial z_j}\)
\[ \begin{align*} \frac{\partial \mathcal{L}}{\partial z_j} = \sum\limits_{i=1}^k\frac{\partial \mathcal{L}}{\partial a_i} \frac{\partial a_i}{\partial z_j} & = \sum\limits_{i=j} \frac{\partial \mathcal{L}}{\partial a_j} \frac{\partial a_j}{\partial z_j} + \sum\limits_{i \neq j} \frac{\partial \mathcal{L}}{\partial a_i} \frac{\partial a_i}{\partial z_j} \\ & = -\frac{y_j}{a_j} \frac{\partial a_j}{\partial z_j} - \sum\limits_{i \neq j} \frac{y_i}{a_i}\frac{a_i}{z_j} \\ & = -\frac{y_j}{a_j} a_j(1 - a_j) + \sum\limits_{i \neq j} \frac{y_i}{a_i} a_i a_j \qquad\qquad \text{运用 (1.1)和(1.2) 式} \\ & = -y_j + y_ja_j + \sum\limits_{i \neq j} y_i a_j \\ & = a_j - y_j \end{align*} \]
若输入为 \(K\) 维向量 \(\mathbf{z} = [z_1, z_2, ..., z_k]^\text{T}\) ,则梯度为:
\[ \frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \mathbf{a} - \mathbf{y} = \begin{bmatrix} a_1 - 0 \\ \vdots \\ a_j - 1 \\ \vdots \\ a_k - 0 \end{bmatrix} \]


内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/zywyss.html