L1 Loss:Creates a criterion that measures the mean absolute error (MAE) between each element in the input :math:x and target :math:y
MAE是目标变量和预测变量之间绝对差值之和。因此它衡量的是一组预测值中的平均误差大小,而不考虑它们的方向(如果我们考虑方向的话,那就是均值误差(MBE)了,即误差之和)。范围为0到∞ \infty∞。
ℓ ( x , y ) = L = { l 1 , … , l N } ⊤ , l n = ∣ x n − y n ∣ \ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = \left| x_n - y_n \right|ℓ(x,y)=L={l1,…,lN}⊤,ln=∣xn−yn∣
ℓ ( x , y ) = { mean ( L ) , if reduction = ’mean’; sum ( L ) , if reduction = ’sum’. \ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{'mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{'sum'.} \end{cases}ℓ(x,y)={mean(L),sum(L),if reduction=’mean’;if reduction=’sum’.
NLLLoss:The negative log likelihood loss
The negative log likelihood loss. It is useful to train a classification problem with C classes.
If provided, the optional argument :attr:weight should be a 1D Tensor assigning
weight to each of the classes. This is particularly useful when you have an
unbalanced training set.
The input given through a forward call is expected to contain
log-probabilities of each class. input has to be a Tensor of size either
:math:(minibatch, C) or :math:(minibatch, C, d_1, d_2, ..., d_K)
with :math:K \geq 1 for the K-dimensional case (described later).
Obtaining log-probabilities in a neural network is easily achieved by
adding a LogSoftmax layer in the last layer of your network.
You may use CrossEntropyLoss instead, if you prefer not to add an extra
layer.
The target that this loss expects should be a class index in the range :math:[0, C-1] where C = number of classes; if ignore_index is specified, this loss also accepts this class index (this index may not necessarily be in the class range).
ℓ ( x , y ) = L = { l 1 , … , l N } ⊤ , l n = − w y n x n , y n , w c = weight [ c ] ⋅ 1 { c ≠ ignore_index } \ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \\ l_n = - w_{y_n} x_{n,y_n}, \\ w_{c} = \text{weight}[c] \cdot \mathbb{1}\{c \not= \text{ignore\_index}\}ℓ(x,y)=L={l1,…,lN}⊤,ln=−wynxn,yn,wc=weight[c]⋅1{c=ignore_index}
ℓ ( x , y ) = { ∑ n = 1 N 1 ∑ n = 1 N w y n l n , if reduction = ’mean’; ∑ n = 1 N l n , if reduction = ’sum’. \ell(x, y) = \begin{cases} \sum_{n=1}^N \frac{1}{\sum_{n=1}^N w_{y_n}} l_n, & \text{if reduction} = \text{'mean';}\\ \sum_{n=1}^N l_n, & \text{if reduction} = \text{'sum'.} \end{cases}ℓ(x,y)={∑n=1N∑n=1Nwyn1ln,∑n=1Nln,if reduction=’mean’;if reduction=’sum’.
PoissonNLLLoss:Negative log likelihood loss with Poisson distribution of target.
The last term can be omitted or approximated with Stirling formula. The
approximation is used for target values more than 1. For targets less or
equal to 1 zeros are added to the loss.
target ∼ P o i s s o n ( input ) loss ( input , target ) = input − target ∗ log ( input ) + log ( target! ) \text{target} \sim \mathrm{Poisson}(\text{input})\\ \text{loss}(\text{input}, \text{target}) = \text{input} - \text{target} * \log(\text{input}) + \log(\text{target!})target∼Poisson(input)loss(input,target)=input−target∗log(input)+log(target!)
KLDivLoss
The Kullback-Leibler divergence_ Loss
KL divergence is a useful distance measure for continuous distributions
and is often useful when performing direct regression over the space of
(discretely sampled) continuous output distributions.
As with :class:~torch.nn.NLLLoss, the input given is expected to contain
log-probabilities and is not restricted to a 2D Tensor.
The targets are given as probabilities (i.e. without taking the logarithm).
This criterion expects a target Tensor of the same size as the
input Tensor.
l ( x , y ) = L = { l 1 , … , l N } , l n = y n ⋅ ( log y n − x n ) l(x,y) = L = \{ l_1,\dots,l_N \}, \quad l_n = y_n \cdot \left( \log y_n - x_n \right)l(x,y)=L={l1,…,lN},ln=yn⋅(logyn−xn)
ℓ ( x , y ) = { mean ( L ) , if reduction = ’mean’; sum ( L ) , if reduction = ’sum’. \ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{'mean';} \\ \operatorname{sum}(L), & \text{if reduction} = \text{'sum'.} \end{cases}ℓ(x,y)={mean(L),sum(L),if reduction=’mean’;if reduction=’sum’.
MSELoss:Creates a criterion that measures the mean squared error (squared L2 norm) between each element in the input :math:x and target :math:y.
ℓ ( x , y ) = L = { l 1 , … , l N } ⊤ , l n = ( x n − y n ) 2 , \ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = \left( x_n - y_n \right)^2,ℓ(x,y)=L={l1,…,lN}⊤,ln=(xn−yn)2,
ℓ ( x , y ) = { mean ( L ) , if reduction = ’mean’; sum ( L ) , if reduction = ’sum’. \ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{'mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{'sum'.} \end{cases}ℓ(x,y)={mean(L),sum(L),if reduction=’mean’;if reduction=’sum’.
BCELoss:Creates a criterion that measures the Binary Cross Entropy between the target and the output:
This is used for measuring the error of a reconstruction in for example
an auto-encoder. Note that the targets :math:y should be numbers
between 0 and 1.
ℓ ( x , y ) = L = { l 1 , … , l N } ⊤ , l n = − w n [ y n ⋅ log x n + ( 1 − y n ) ⋅ log ( 1 − x n ) ] \ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \\ l_n = - w_n \left[ y_n \cdot \log x_n + (1 - y_n) \cdot \log (1 - x_n) \right]ℓ(x,y)=L={l1,…,lN}⊤,ln=−wn[yn⋅logxn+(1−yn)⋅log(1−xn)]
ℓ ( x , y ) = { mean ( L ) , if reduction = ’mean’; sum ( L ) , if reduction = ’sum’. \ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{'mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{'sum'.} \end{cases}ℓ(x,y)={mean(L),sum(L),if reduction=’mean’;if reduction=’sum’.
BCEWithLogitsLoss:This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid
followed by a BCELoss as, by combining the operations into one layer,
we take advantage of the log-sum-exp trick for numerical stability.
ℓ ( x , y ) = L = { l 1 , … , l N } ⊤ , l n = − w n [ y n ⋅ log σ ( x n ) + ( 1 − y n ) ⋅ log ( 1 − σ ( x n ) ) ] , \ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \\ l_n = - w_n \left[ y_n \cdot \log \sigma(x_n) + (1 - y_n) \cdot \log (1 - \sigma(x_n)) \right],ℓ(x,y)=L={l1,…,lN}⊤,ln=−wn[yn⋅logσ(xn)+(1−yn)⋅log(1−σ(xn))],
ℓ ( x , y ) = { mean ( L ) , if reduction = ’mean’; sum ( L ) , if reduction = ’sum’. \ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{'mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{'sum'.} \end{cases}ℓ(x,y)={mean(L),sum(L),if reduction=’mean’;if reduction=’sum’.
This is used for measuring the error of a reconstruction in for example
an auto-encoder. Note that the targets t[i] should be numbers
between 0 and 1.
It’s possible to trade off recall and precision by adding weights to positive examples. In the case of multi-label classification the loss can be described as:
ℓ c ( x , y ) = L c = { l 1 , c , … , l N , c } ⊤ , l n , c = − w n , c [ p c y n , c ⋅ log σ ( x n , c ) + ( 1 − y n , c ) ⋅ log ( 1 − σ ( x n , c ) ) ] , \ell_c(x, y) = L_c = \{l_{1,c},\dots,l_{N,c}\}^\top, \\ l_{n,c} = - w_{n,c} \left[ p_c y_{n,c} \cdot \log \sigma(x_{n,c}) + (1 - y_{n,c}) \cdot \log (1 - \sigma(x_{n,c})) \right],ℓc(x,y)=Lc={l1,c,…,lN,c}⊤,ln,c=−wn,c[pcyn,c⋅logσ(xn,c)+(1−yn,c)⋅log(1−σ(xn,c))],
HingeEmbeddingLoss:Measures the loss given an input tensor :math:x and a labels tensor :math:y
(containing 1 or -1).
This is usually used for measuring whether two inputs are similar or
dissimilar, e.g. using the L1 pairwise distance as :math:x, and is typically
used for learning nonlinear embeddings or semi-supervised learning.
The loss function for :math:n-th sample in the mini-batch is
l n = { x n , if y n = 1 , max { 0 , Δ − x n } , if y n = − 1 , l_n = \begin{cases} x_n, & \text{if}\; y_n = 1,\\ \max \{0, \Delta - x_n\}, & \text{if}\; y_n = -1, \end{cases}ln={xn,max{0,Δ−xn},ifyn=1,ifyn=−1,
and the total loss functions is
L = { l 1 , … , l N } ⊤ ℓ ( x , y ) = { mean ( L ) , if reduction = ’mean’; sum ( L ) , if reduction = ’sum’. L = \{l_1,\dots,l_N\}^\top\\ \ell(x, y) = \begin{cases} \operatorname{mean}(L), & \text{if reduction} = \text{'mean';}\\ \operatorname{sum}(L), & \text{if reduction} = \text{'sum'.} \end{cases}L={l1,…,lN}⊤ℓ(x,y)={mean(L),sum(L),if reduction=’mean’;if reduction=’sum’.
MultiLabelMarginLoss:Creates a criterion that optimizes a multi-class multi-classification
hinge loss (margin-based loss) between input :math:x (a 2D mini-batch Tensor)
and output :math:y (which is a 2D Tensor of target class indices).
For each sample in the mini-batch:
loss ( x , y ) = ∑ i j max ( 0 , 1 − ( x [ y [ j ] ] − x [ i ] ) ) x.size ( 0 ) x ∈ { 0 , ⋯ , x.size ( 0 ) − 1 } y ∈ { 0 , ⋯ , y.size ( 0 ) − 1 } 0 ≤ y [ j ] ≤ x.size ( 0 ) − 1 i ≠ y [ j ] , ∀ i , j \text{loss}(x, y) = \sum_{ij}\frac{\max(0, 1 - (x[y[j]] - x[i]))}{\text{x.size}(0)}\\ x \in \left\{0, \; \cdots , \; \text{x.size}(0) - 1\right\}\\ y \in \left\{0, \; \cdots , \; \text{y.size}(0) - 1\right\}\\ 0 \leq y[j] \leq \text{x.size}(0)-1\\ i \neq y[j] ,\quad \forall i , jloss(x,y)=ij∑x.size(0)max(0,1−(x[y[j]]−x[i]))x∈{0,⋯,x.size(0)−1}y∈{0,⋯,y.size(0)−1}0≤y[j]≤x.size(0)−1i=y[j],∀i,j
The criterion only considers a contiguous block of non-negative targets that
starts at the front.
This allows for different samples to have variable amounts of target classes.
SmoothL1Loss:Creates a criterion that uses a squared term if the absolute
element-wise error falls below 1 and an L1 term otherwise.
It is less sensitive to outliers than the MSELoss and in some cases
prevents exploding gradients (e.g. see Fast R-CNN paper by Ross Girshick).
Also known as the Huber loss:
loss ( x , y ) = 1 n ∑ i z i z i = { 0.5 ( x i − y i ) 2 , if ∣ x i − y i ∣ < 1 ∣ x i − y i ∣ − 0.5 , otherwise \text{loss}(x, y) = \frac{1}{n} \sum_{i} z_{i}\\ z_{i} = \begin{cases} 0.5 (x_i - y_i)^2, & \text{if } |x_i - y_i| < 1 \\ |x_i - y_i| - 0.5, & \text{otherwise } \end{cases}loss(x,y)=n1i∑zizi={0.5(xi−yi)2,∣xi−yi∣−0.5,if ∣xi−yi∣<1otherwise
SoftMarginLoss:Creates a criterion that optimizes a two-class classification
logistic loss between input tensor :math:x and target tensor :math:y
(containing 1 or -1).
loss ( x , y ) = ∑ i log ( 1 + exp ( − y [ i ] ∗ x [ i ] ) ) x.nelement ( ) \text{loss}(x, y) = \sum_i \frac{\log(1 + \exp(-y[i]*x[i]))}{\text{x.nelement}()}loss(x,y)=i∑x.nelement()log(1+exp(−y[i]∗x[i]))
CrossEntropyLoss:This criterion combines :func:nn.LogSoftmax and :func:nn.NLLLoss in one single class.
It is useful when training a classification problem with C classes.
If provided, the optional argument :attr:weight should be a 1D Tensor
assigning weight to each of the classes.
This is particularly useful when you have an unbalanced training set.
The input is expected to contain raw, unnormalized scores for each class.
input has to be a Tensor of size either :math:(minibatch, C) or
:math:(minibatch, C, d_1, d_2, ..., d_K)
with :math:K \geq 1 for the K-dimensional case (described later).
This criterion expects a class index in the range :math:[0, C-1] as the
target for each value of a 1D tensor of size minibatch; if ignore_index
is specified, this criterion also accepts this class index (this index may not
necessarily be in the class range).
loss ( x , c l a s s ) = − log ( exp ( x [ c l a s s ] ) ∑ j exp ( x [ j ] ) ) = − x [ c l a s s ] + log ( ∑ j exp ( x [ j ] ) ) \text{loss}(x, class) = -\log\left(\frac{\exp(x[class])}{\sum_j \exp(x[j])}\right)\\ = -x[class] + \log\left(\sum_j \exp(x[j])\right)loss(x,class)=−log(∑jexp(x[j])exp(x[class]))=−x[class]+log(j∑exp(x[j]))
loss ( x , c l a s s ) = w e i g h t [ c l a s s ] ( − x [ c l a s s ] + log ( ∑ j exp ( x [ j ] ) ) ) \text{loss}(x, class) = weight[class] \left(-x[class] + \log\left(\sum_j \exp(x[j])\right)\right)loss(x,class)=weight[class](−x[class]+log(j∑exp(x[j])))
The losses are averaged across observations for each minibatch.
Can also be used for higher dimension inputs, such as 2D images, by providing
an input of size :math:(minibatch, C, d_1, d_2, ..., d_K) with :math:K \geq 1,
where :math:K is the number of dimensions, and a target of appropriate shape.
MultiLabelSoftMarginLoss:Creates a criterion that optimizes a multi-label one-versus-all
loss based on max-entropy, between input :math:x and target :math:y of size
:math:(N, C).
For each sample in the minibatch:
l o s s ( x , y ) = − 1 C ∗ ∑ i y [ i ] ∗ log ( ( 1 + exp ( − x [ i ] ) ) − 1 ) + ( 1 − y [ i ] ) ∗ log ( exp ( − x [ i ] ) ( 1 + exp ( − x [ i ] ) ) ) i ∈ { 0 , ⋯ , x.nElement ( ) − 1 } , y [ i ] ∈ { 0 , 1 } loss(x, y) = - \frac{1}{C} * \sum_i y[i] * \log((1 + \exp(-x[i]))^{-1})\\ + (1-y[i]) * \log\left(\frac{\exp(-x[i])}{(1 + \exp(-x[i]))}\right)\\ i \in \left\{0, \; \cdots , \; \text{x.nElement}() - 1\right\},\\ y[i] \in \left\{0, \; 1\right\}loss(x,y)=−C1∗i∑y[i]∗log((1+exp(−x[i]))−1)+(1−y[i])∗log((1+exp(−x[i]))exp(−x[i]))i∈{0,⋯,x.nElement()−1},y[i]∈{0,1}
CosineEmbeddingLoss:Creates a criterion that measures the loss given input tensors
:math:x_1, :math:x_2 and a Tensor label :math:y with values 1 or -1.
This is used for measuring whether two inputs are similar or dissimilar,
using the cosine distance, and is typically used for learning nonlinear
embeddings or semi-supervised learning.
loss ( x , y ) = { 1 − cos ( x 1 , x 2 ) , if y = 1 max ( 0 , cos ( x 1 , x 2 ) − margin ) , if y = − 1 \text{loss}(x, y) = \begin{cases} 1 - \cos(x_1, x_2), & \text{if } y = 1 \\ \max(0, \cos(x_1, x_2) - \text{margin}), & \text{if } y = -1 \end{cases}loss(x,y)={1−cos(x1,x2),max(0,cos(x1,x2)−margin),if y=1if y=−1
MarginRankingLoss:Creates a criterion that measures the loss given
inputs :math:x1, :math:x2, two 1D mini-batch Tensors,
and a label 1D mini-batch tensor :math:y (containing 1 or -1).
If :math:y = 1 then it assumed the first input should be ranked higher
(have a larger value) than the second input, and vice-versa for :math:y = -1.
The loss function for each sample in the mini-batch is:
loss ( x , y ) = max ( 0 , − y ∗ ( x 1 − x 2 ) + margin ) \text{loss}(x, y) = \max(0, -y * (x1 - x2) + \text{margin})loss(x,y)=max(0,−y∗(x1−x2)+margin)
MultiMarginLoss:Creates a criterion that optimizes a multi-class classification hinge
loss (margin-based loss) between input :math:x (a 2D mini-batch Tensor) and
output :math:y (which is a 1D tensor of target class indices,
:math:0 \leq y \leq \text{x.size}(1)-1):
For each mini-batch sample, the loss in terms of the 1D input :math:x and scalar
output :math:y is:
loss ( x , y ) = ∑ i max ( 0 , margin − x [ y ] + x [ i ] ) ) p x.size ( 0 ) x ∈ { 0 , ⋯ , x.size ( 0 ) − 1 } , i ≠ y \text{loss}(x, y) = \frac{\sum_i \max(0, \text{margin} - x[y] + x[i]))^p}{\text{x.size}(0)}\\ x \in \left\{0, \; \cdots , \; \text{x.size}(0) - 1\right\}, i \neq yloss(x,y)=x.size(0)∑imax(0,margin−x[y]+x[i]))px∈{0,⋯,x.size(0)−1},i=y
Optionally, you can give non-equal weighting on the classes by passing
a 1D :attr:weight tensor into the constructor.
The loss function then becomes:
loss ( x , y ) = ∑ i max ( 0 , w [ y ] ∗ ( margin − x [ y ] + x [ i ] ) ) p ) x.size ( 0 ) \text{loss}(x, y) = \frac{\sum_i \max(0, w[y] * (\text{margin} - x[y] + x[i]))^p)}{\text{x.size}(0)}loss(x,y)=x.size(0)∑imax(0,w[y]∗(margin−x[y]+x[i]))p)
TripletMarginLoss:
Creates a criterion that measures the triplet loss given an input
tensors :math:x1, :math:x2, :math:x3 and a margin with a value greater than :math:0.
This is used for measuring a relative similarity between samples. A triplet
is composed by a, p and n (i.e., anchor, positive examples and negative examples respectively). The shapes of all input tensors should be
:math:(N, D).
The distance swap is described in detail in the paper Learning shallow convolutional feature descriptors with triplet losses_ by
V. Balntas, E. Riba et al.
The loss function for each sample in the mini-batch is:
L ( a , p , n ) = max { d ( a i , p i ) − d ( a i , n i ) + m a r g i n , 0 } d ( x i , y i ) = ∥ x i − y i ∥ p L(a, p, n) = \max \{d(a_i, p_i) - d(a_i, n_i) + {\rm margin}, 0\}\\ d(x_i, y_i) = \left\lVert {\bf x}_i - {\bf y}_i \right\rVert_pL(a,p,n)=max{d(ai,pi)−d(ai,ni)+margin,0}d(xi,yi)=∥xi−yi∥p
CTCLoss:
The Connectionist Temporal Classification loss.
Calculates loss between a continuous (unsegmented) time series and a target sequence. CTCLoss sums over the
probability of possible alignments of input to target, producing a loss value which is differentiable
with respect to each input node. The alignment of input to target is assumed to be “many-to-one”, which
limits the length of the target sequence such that it must be :math:\leq the input length.