神经网络求导

本篇本来是想写神经网络反向传播算法,但感觉光写这个不是很完整,所以就在前面将相关的求导内容一并补上。所谓的神经网络求导,核心是损失函数对线性输出 \(\mathbf{z} \;\; (\mathbf{z} = \mathbf{Wa} + \mathbf{b})\) 求导,即反向传播中的 \(\delta = \frac{\partial \mathcal{L}}{\partial \mathbf{z}}\) ,求出了该值以后后面的对参数求导就相对容易了。



\(\text{Jacobian}\) 矩阵

函数 \(\boldsymbol{f} : \mathbb{R}^n \rightarrow \mathbb{R}^m\) ,则 \(\text{Jacobian}\) 矩阵为:
\[ \frac{\partial \boldsymbol{f}}{\partial \mathbf{x}} = \begin {bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots \\ \frac{\partial f_m}{\partial x_1} &\frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end {bmatrix} \in \mathbb{R}^{m \times n} \]
\((\frac{\partial \boldsymbol{f}}{\partial \mathbf{x}})_{ij} = \frac{\partial f_i}{\partial x_j}\)


神经网络中的激活函数多为对应元素运算 ( element-wise ) ,设输入为K维向量 \(\mathbf{x} = [x_1, x_2, ..., x_K ]^\text{T}\) , 输出为K维向量 \(\mathbf{z} = [z_1, z_2, ..., z_K]^\text{T}\) ,则激活函数为 \(\mathbf{z} = f(\mathbf{x})\) ,即 \(z_i = [f(\mathbf{x})]_i = f(x_i)\) ,则其导数按 \(\text{Jacobian}\) 矩阵的定义为一个对角矩阵:

\[ \begin{align*} \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}} = \begin {bmatrix} \frac{\partial f(x_1)}{\partial x_1} & \frac{\partial f(x_1)}{\partial x_2} & \cdots & \frac{\partial f(x_1)}{\partial x_k} \\ \frac{\partial f(x_2)}{\partial x_1} & \frac{\partial f(x_2)}{\partial x_2} & \cdots & \frac{\partial f(x_2)}{\partial x_k} \\ \vdots & \vdots & \ddots \\ \frac{\partial f(x_k)}{\partial x_1} &\frac{\partial f(x_k)}{\partial x_2} & \cdots & \frac{\partial f(x_k)}{\partial x_k} \end {bmatrix} & = \begin {bmatrix} f'(x_1) & 0 &\cdots & 0 \\ 0 & f'(x_2) & \cdots & 0 \\ \vdots & \vdots & \ddots \\ 0 & 0 & \cdots & f'(x_k) \end {bmatrix} \\[2ex] & = \text{diag}(f'(\mathbf{x})) \in \mathbb{R}^{k \times k} \end{align*} \]




\(\text{Sigmoid}\) 激活函数

\(\text{Sigmoid}\) 函数的形式为:
\[ \sigma(z) = \frac{1}{1+e^{\,-z}} \;\;\in (0,1) \]
其导数为:
\[ \sigma'(z) = -\frac{(1+e^{-z})'}{(1 + e^{-z})^2} = -\frac{-e^{-z}}{(1+ e^{-z})^2} = \frac{e^{-z}}{1 + e^{-z}} \cdot \frac{1}{1 + e^{-z}} = \sigma(z) (1 - \sigma(z)) \]
若输入为 K 维向量 \(\mathbf{z} = [z_1, z_2, ..., z_K]^\text{T}\) ,根据上文的定义,其导数为
\[ \begin{align*} \sigma'(\mathbf{z}) &= \begin {bmatrix} \sigma(z_1) (1 - \sigma(z_1)) &0& \cdots & 0 \\ 0 & \sigma(z_2) (1 - \sigma(z_2)) & \cdots & 0 \\ \vdots & \vdots & \ddots \\ 0 & 0 & \cdots & \sigma(z_k) (1 - \sigma(z_k)) \end {bmatrix} \\[3ex] & = \text{diag} \left(\sigma(\mathbf{z}) \odot (1-\sigma(\mathbf{z}))\right) \end{align*} \]




\(\text{Tanh}\) 激活函数

\(\text{Tanh}\) 函数可以看作是放大并平移的 \(\text{Sigmoid}\) 函数,但因为是零中心化的 (zero-centered) ,通常收敛速度快于 \(\text{Sigmoid}\) 函数,下图是二者的对比:
\[ \text{tanh}(z) = \frac{e^{z} - e^{-z}}{e^z + e^{-z}} = \frac{2}{1 + e^{-2z}} - 1 = 2\sigma(2z) - 1 \;\; \in(-1,1) \]

其导数为:
\[ \text{tanh}'(z) = \frac{(e^z + e^{-z})^2 - (e^z - e^{-z})^2}{(e^z + e^{-z})^2} = 1 - \text{tanh}^2(z) \]




\(\text{Softplus}\) 激活函数

\(\text{Softplus}\) 函数可以看作是 \(\frak{ReLU}\) 函数的平滑版本,形式为:
\[ \text{softplus}(z) = \text{log}(1+ e^z) \]

而其导数则恰好就是 \(\text{Sigmoid}\) 函数:
\[ \text{softplus}'(z) = \frac{e^z}{1 + e^z} = \frac{1}{1+ e^{-z}} \]




\(\text{Softmax}\) 激活函数

\(\text{softmax}\) 函数将多个标量映射为一个概率分布,其形式为:
\[ y_i = \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{k=1}^C e^{z_k}} \]
\(y_i\) 表示第 \(i\) 个输出值,也可表示属于类别 \(i\) 的概率, \(\sum\limits_{i=1}^C y_i = 1\)

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/zywyss.html