首先求标量形式的导数,即第 \(i\) 个输出对于第 \(j\) 个输入的偏导:
\[
\frac{\partial y_i}{\partial z_j} = \frac{\partial\, \frac{e^{z_i}}{\sum_{k=1}^{C} e^{a_k}}}{\partial z_j}
\]
其中 \(e^{z_i}\) 对 \(z_j\) 求导要分情况讨论,即:
\[
\frac{\partial e^{z_i}}{\partial z_j} =
\begin{cases}
e^{z_i}, & \text{if} \;\;\; i = j \\[1ex]
0, & \text{if} \;\;\; i \neq j
\end{cases}
\]
那么当 \(i =j\) 时:
\[
\frac{\partial y_i}{\partial z_j} = \frac{e^{z_i} \sum_{k=1}^Ce^{z_k} - e^{z_i}e^{z_j}}{\left(\sum_{k=1}^C e^{z_k}\right)^2} = \frac{e^{z_i}}{\sum_{k=1}^C e^{z_k}} - \frac{e^{z_i}}{\sum_{k=1}^C e^{z_k}} \frac{e^{z_j}}{\sum_{k=1}^C e^{z_k}} =y_i - y_i y_j \tag{1.1}
\]
当 \(i \neq j\) 时:
\[
\frac{\partial y_i}{\partial z_j} = \frac{0 - e^{z_i}e^{z_j}}{\left(\sum_{k=1}^C e^{z_k}\right)^2} = -y_iy_j \tag{1.2}
\]
于是二者综合:
\[
\frac{\partial y_i}{\partial z_j} = \mathbf{\large1} \{i=j\}\, y_i - y_i\,y_j \tag{1.3}
\]
其中 \(\mathbf{\large 1} \{i=j\} = \begin{cases}1, & \text{if} \;\;\; i = j \\0, & \text{if} \;\;\; i \neq j\end{cases}\)
当 \(\text{softmax}\) 函数的输入为K 维向量 \(\mathbf{z} = [z_1, z_2, ..., z_K]^\text{T}\) 时,转换形式为 \(\mathbb{R}^K \rightarrow \mathbb{R}^K\) :
\[
\mathbf{y} = \text{softmax}(\mathbf{z}) = \frac{1}{\sum_{k=1}^K e^{z_k}}
\begin{bmatrix}
e^{z_1} \\
e^{z_2} \\
\vdots \\
e^{z_K}
\end{bmatrix}
\]
其导数同样为 \(\text{Jabocian}\) 矩阵 ( 同时利用 \((1.1)\) 和 \((1.2)\) 式 ):
\[
\begin{align*}
\frac{\partial\, \mathbf{y}}{\partial\, \mathbf{z}} & =
\begin {bmatrix}
\frac{\partial y_1}{\partial z_1} & \frac{\partial y_1}{\partial z_2} & \cdots & \frac{\partial y_1}{\partial z_K} \\
\frac{\partial y_2}{\partial z_1} & \frac{\partial y_2}{\partial z_2} & \cdots & \frac{\partial y_2}{\partial z_K} \\
\vdots & \vdots & \ddots \\
\frac{\partial y_K}{\partial z_1} &\frac{\partial y_K}{\partial z_2} & \cdots & \frac{\partial y_K}{\partial z_K}
\end {bmatrix} \\[2ex]
& =
\begin {bmatrix}
\small{y_1 - y_1 y_1} & \small{-y_1y_2} & \cdots & \small{-y_1 y_K} \\
\small{-y_2y_1} & \small{y_2 - y_2 y_1} & \cdots & \small{-y_2 y_K} \\
\vdots & \vdots & \ddots \\
\small{-y_Ky_1} & \small{-y_K y_2} & \cdots & \small{y_K - y_K y_K}
\end {bmatrix} \\[2.5ex]
& =
\text{diag}(\mathbf{y}) - \mathbf{y}\mathbf{y}^\text{T} \\[0.5ex]
&=
\text{diag}(\text{softmax}(\mathbf{z})) - \text{softmax}(\mathbf{z})\, \text{softmax}(\mathbf{z})^\text{T}
\end{align*}
\]
交叉熵损失有两种表示形式,设真实标签为 \(y\) ,预测值为 \(a\) :
(一) \(y\) 为标量,即 \(y \in \mathbb{R}\) ,则交叉熵损失为:
\[
\mathcal{L}(y, a) = - \sum\limits_{j=1}^{k} \mathbf{\large 1}\{y = j\}\, \text{log}\, a_j
\]
(二) \(y\) 为one-hot向量,即 \(y = \left[0,0...1...0\right]^\text{T} \in \mathbb{R}^k\) ,则交叉熵损失为:
\[
\mathcal{L}(y, a) = -\sum\limits_{j=1}^k y_j\, \text{log}\, a_j
\]
已知 \(\mathcal{L}(y, a) = -\sum\limits_{j=1}^k y_j\, \text{log}\, a_j\), \(a_j = \sigma(z_j) = \frac{1}{1+e^{\,-z_j}}\) ,求 \(\frac{\partial \mathcal{L}}{z_j}\) :
\[
\frac{\partial \mathcal{L}}{\partial z_j} = \frac{\partial \mathcal{L}}{\partial a_j} \frac{\partial a_j}{\partial z_j} = -y_j \frac{1}{\sigma(z_j)} \sigma(z_j) (1 - \sigma(z_j)) = \sigma(z_j) - 1 = a_j - y_j
\]
已知 \(\mathcal{L}(y, a) = -\sum\limits_{i=1}^k y_i\, \text{log}\, a_i\), \(a_j = \text{softmax}(z_j) = \frac{e^{z_j}}{\sum_{c=1}^C e^{z_c}}\) ,求 \(\frac{\partial \mathcal{L}}{\partial z_j}\) :
\[
\begin{align*}
\frac{\partial \mathcal{L}}{\partial z_j} = \sum\limits_{i=1}^k\frac{\partial \mathcal{L}}{\partial a_i} \frac{\partial a_i}{\partial z_j} & = \sum\limits_{i=j} \frac{\partial \mathcal{L}}{\partial a_j} \frac{\partial a_j}{\partial z_j} + \sum\limits_{i \neq j} \frac{\partial \mathcal{L}}{\partial a_i} \frac{\partial a_i}{\partial z_j} \\
& = -\frac{y_j}{a_j} \frac{\partial a_j}{\partial z_j} - \sum\limits_{i \neq j} \frac{y_i}{a_i}\frac{a_i}{z_j} \\
& = -\frac{y_j}{a_j} a_j(1 - a_j) + \sum\limits_{i \neq j} \frac{y_i}{a_i} a_i a_j \qquad\qquad \text{运用 (1.1)和(1.2) 式} \\
& = -y_j + y_ja_j + \sum\limits_{i \neq j} y_i a_j \\
& = a_j - y_j
\end{align*}
\]
若输入为 \(K\) 维向量 \(\mathbf{z} = [z_1, z_2, ..., z_k]^\text{T}\) ,则梯度为:
\[
\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \mathbf{a} - \mathbf{y} =
\begin{bmatrix}
a_1 - 0 \\
\vdots \\
a_j - 1 \\
\vdots \\
a_k - 0
\end{bmatrix}
\]