1. 为什么比较大的输入会使得softmax的梯度变得很小？

对于一个输入向量 $\mathbf{x} \in \mathbb{R}^{d}$ ，softmax函数将其映射/归一化到一个分布 $\hat{\mathbf{y}} \in \mathbb{R}^{d}$ 。在这个过程中，softmax先用一个自然底数 $e$ 将输入中的元素间差距先“拉大”，然后归一化为一个分布。假设某个输入 $x$ 中最大的的元素下标是 $k$ ，如果输入的数量级变大（每个元素都很大），那么 $\hat{y}_{k}$ 会非常接近1。

我们可以用一个小例子来看看 $x$ 的数量级对输入最大元素对应的预测概率 $\hat{y}_{k}$ 的影响。假定输入 $\mathbf{x}=[a, a, 2 a]^{\top}$ ），我们来看不同量级的 $a$ 产生的 $\hat{y}_{3}$ 有什么区别。

$a = 1$ 时， $\hat{y}_{3}=0.5761168847658291$ ；
$a = 10$ 时， $\hat{y}_{3}=0.999909208384341$ ；
$a = 100$ 时， $\hat{y}_{3} \approx 1.0$ （计算精度限制）；

我们不妨把 $a$ 在不同取值下，对应的 $\hat{y}_{3}$ 全部绘制出来。代码如下：

from math import exp
from matplotlib import pyplot as plt
import numpy as np 
f = lambda x: exp(x * 2) / (exp(x) + exp(x) + exp(x * 2))
x = np.linspace(0, 100, 100)
y_3 = [f(x_i) for x_i in x]
plt.plot(x, y_3)
plt.show()

得到的图如下所示：
在这里插入图片描述
可以看到，数量级对softmax得到的分布影响非常大。在数量级较大时，softmax将几乎全部的概率分布都分配给了最大值对应的标签。

然后我们来看softmax的梯度。不妨简记softmax函数为 $g(\cdot)$ ，softmax得到的分布向量 $\hat{\mathbf{y}}=g(\mathbf{x})$ 对输入 $x$ 的梯度为：
$\frac{\partial g(\mathbf{x})}{\partial \mathbf{x}}=\operatorname{diag}(\hat{\mathbf{y}})-\hat{\mathbf{y}} \hat{\mathbf{y}}^{\top} \quad \in \mathbb{R}^{d \times d}$ 把这个矩阵展开：
$\frac{\partial g(\mathbf{x})}{\partial \mathbf{x}}=\left[\begin{array}{cccc} \hat{y}_{1} & 0 & \cdots & 0 \\ 0 & \hat{y}_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \hat{y}_{d} \end{array}\right]-\left[\begin{array}{cccc} \hat{y}_{1}^{2} & \hat{y}_{1} \hat{y}_{2} & \cdots & \hat{y}_{1} \hat{y}_{d} \\ \hat{y}_{2} \hat{y}_{1} & \hat{y}_{2}^{2} & \cdots & \hat{y}_{2} \hat{y}_{d} \\ \vdots & \vdots & \ddots & \vdots \\ \hat{y}_{d} \hat{y}_{1} & \hat{y}_{d} \hat{y}_{2} & \cdots & \hat{y}_{d}^{2} \end{array}\right]$ 根据前面的讨论，当输入 $x$ 的元素均较大时，softmax会把大部分概率分布分配给最大的元素，假设我们的输入数量级很大，最大的元素是 $x_1$ ，那么就将产生一个接近one-hot的向量 $\hat{\mathbf{y}} \approx[1,0, \cdots, 0]^{\top}$ ,此时上面的矩阵变为如下形式：
$\frac{\partial g(\mathbf{x})}{\partial \mathbf{x}} \approx\left[\begin{array}{cccc} 1 & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 \end{array}\right]-\left[\begin{array}{cccc} 1 & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 \end{array}\right]=\mathbf{0}$ 也就是说，在输入的数量级很大时，梯度消失为0，造成参数更新困难。

注： softmax的梯度可以自行推导，网络上也有很多推导可以参考。

2. 维度与点积大小的关系是怎么样的，为什么使用维度的根号来放缩？

针对为什么维度会影响点积的大小，在论文的脚注中其实给出了一点解释：
在这里插入图片描述假设向量 $q$ 和 $k$ 的各个分量是互相独立的随机变量，均值是0，方差是1，那么点积 $\cdot k$ 的均值是0，方差是 $d_k$ 。这里我给出一点更详细的推导：

对 $\forall i=1, \cdots, d_{k}$ ， $q_i$ 和 $k_i$ 都是随机变量，为了方便书写，不妨记 $X=q_i$ ， $Y=k_i$ 。这样有： $D (X) = D (Y) = 1$ ， $E (X) = E (Y) = 0$ 。则：

$\times 0=0$
$\begin{aligned} D(X Y) &=E\left(X^{2} \cdot Y^{2}\right)-[E(X Y)]^{2} \\ &=E\left(X^{2}\right) E\left(Y^{2}\right)-[E(X) E(Y)]^{2} \\ &=E\left(X^{2}-0^{2}\right) E\left(Y^{2}-0^{2}\right)-[E(X) E(Y)]^{2} \\ &=E\left(X^{2}-[E(X)]^{2}\right) E\left(Y^{2}-[E(Y)]^{2}\right)-[E(X) E(Y)]^{2} \\ &=D(X) D(Y)-[E(X) E(Y)]^{2} \\ &=1 \times 1-(0 \times 0)^{2} \\ &=1 \end{aligned}$

这样 $\forall i=1, \cdots, d_{k}$ ， $q_i \cdot k_i$ 的均值是0，方差是1，又由期望和方差的性质，对相互独立的分量 $z_i$ ，有
$E\left(\sum_{i} Z_{i}\right)=\sum_{i} E\left(Z_{i}\right)$
以及
$D\left(\sum_{i} Z_{i}\right)=\sum_{i} D\left(Z_{i}\right)$
所以有 $\cdot k$ 的均值 $\cdot k)=0$ ，方差 $\cdot k)=d_{k}$ 。方差越大也就说明，点积的数量级越大（以越大的概率取大值）。那么一个自然的做法就是把方差稳定到1，做法是将点积除以 $\sqrt{d}_{k}$ ，这样有：
$D\left(\frac{q \cdot k}{\sqrt{d}_{k}}\right)=\frac{d_{k}}{\left(\sqrt{d}_{k}\right)^{2}}=1$ 将方差控制为1，也就有效地控制了前面提到的梯度消失的问题。

可以参考一下。水平有限，如果有误请指出。