detctron2 rpn 损失函数解读

首先我们把我们想要设计的损失函数目标搞明白：

设计目标其一：希望预测的 anchor ，且被抽样出来的那几个预测框，位置趋近于标记框的位置。
设计目标其二：希望预测的 anchor ，且被抽样出来的那几个预测框，是正样本还是负样本的预测趋近于真值。

一. 数据准备，方便后续计算损失

也是先搞清楚目标：准备如下两组数据
gt_labels ：一个批次的图片，取样后，针对所有anchor 代表了正样本，负样本和忽略样本 anchor 1,0,-1 的矩阵
matched_gt_boxes：取样的图片上的所占的 IOU 最大的那个 lable 的 4个点 tensor

准备数据的入口函数：

detectron2.modeling.proposal_generator.rpn.RPN.label_and_sample_anchors

重点代码行解读

match_quality_matrix = retry_if_cuda_oom(pairwise_iou)(gt_boxes_i, anchors)

这个函数的实现非常巧妙，没几行代码，几个矩阵运算函数就搞定了，参见我的另外一篇文章：

https://blog.csdn.net/weixin_41946623/article/details/114088381?spm=1001.2014.3001.5501

match_quality_matrix：代表了每个label 对应的所有 anchor 的 IOU 矩阵
比如我们给某张图片进行标记了 100个 label ，然后我们为这个图片生成了 20000个 anchor
则该矩阵的shape 是 [100,20000]

matched_idxs, gt_labels_i = retry_if_cuda_oom(self.anchor_matcher)(match_quality_matrix)

matched_idxs：代表了每个anchor 对应的图片上的所占的IOU 最大的那个lable 的 ID
gt_labels_i ：代表了每个 anchor 1,0,-1 的值

该函数实现细节参考解读：

链接：http://note.youdao.com/noteshare?id=d8f6e3057cbff88dfa5b57c0118494b5&sub=BD8EA123D2C34E7DB366BDECC8562F5D

if self.anchor_boundary_thresh >= 0:
    # Discard anchors that go out of the boundaries of the image
    # NOTE: This is legacy functionality that is turned off by default in Detectron2
    anchors_inside_image = anchors.inside_box(image_size_i, self.anchor_boundary_thresh)
    gt_labels_i[~anchors_inside_image] = -1

重点看里面的注释说：把超出图片边界区域的anchor 置为忽略。说这个是历史遗留问题，现在不需要了。这里我还没看明白为什么不需要了。作为TODO

gt_labels_i = self._subsample_labels(gt_labels_i)

为了后续的训练数据正样本和负样本均衡，取样差不多的正样本和负样本
gt_labels_i ：取样后的代表了每个 anchor 1,0,-1 的值

根据代码调进去，具体取样函数：

def subsample_labels(
    labels: torch.Tensor, num_samples: int, positive_fraction: float, bg_label: int
):

labels ：传入的是 gt_labels_i 代表了每个 anchor 1,0,-1 的值
num_samples：传入的是配置项：cfg.MODEL.RPN.BATCH_SIZE_PER_IMAGE 的值，默认是256
positive_fraction: 传入的是 cfg.MODEL.RPN.POSITIVE_FRACTION ，默认是 0.5
bg_label ：传入的是写死的 0

正样本总取样数量是： num_samples * positive_fraction , 默认值计算是 128 个。
当正样本数量本身就不足 64 个的情况下，有多少个正样本就取多少个。但不超过 num_samples * positive_fraction 个
负样本总取样数量：min(num_negatives, num_samples - num_positives_sampled)
也就是说在负样本足够多的情况下，取 num_samples - num_positives_sampled
负样本不够的情况下，能取几个负样本就取几个负样本。

最终整个代码返回：
gt_labels ：一个批次的图片，每个图片取样后的取样后的代表了每个 anchor 1,0,-1 的值
matched_gt_boxes：取样的图片上的所占的IOU 最大的那个lable 的 4个点 tensor

二. 损失函数

设计目标其一： 希望预测的 anchor ，并且被抽样出来的那几个预测框的位置非常接近其对应的标记框的位置。所以设计 smooth_l1_loss 的损失函数。
对应到代码即： pred_anchor_deltas 和 gt_anchor_deltas 一致

涉及配置项：cfg.MODEL.RPN.BBOX_REG_LOSS_TYPE默认值是：smooth_l1

计算流程：

gt_anchor_deltas = [self.box2box_transform.get_deltas(anchors, k) for k in gt_boxes]

dx = wx * (target_ctr_x - src_ctr_x) / src_widths
dy = wy * (target_ctr_y - src_ctr_y) / src_heights
dw = ww * torch.log(target_widths / src_widths)
dh = wh * torch.log(target_heights / src_heights)

解释：这里 wx,wy,ww,wh 涉及配置项：MODEL.RPN.BBOX_REG_WEIGHTS = (1.0, 1.0, 1.0, 1.0)

dx = (matched_gt_boxes 的中心点x 坐标 - anchor 的中心点x 坐标) / anchor 宽度
dy = (matched_gt_boxes 的中心点y 坐标 - anchor 的中心点y 坐标) / anchor 宽度
dw = log ( matched_gt_boxes 的宽度/ anchor 框的宽度)
dh = log ( matched_gt_boxes 的高度/ anchor 框的高度)

deltas = torch.stack((dx, dy, dw, dh), dim=1)

把这些拼接成 deltas , 假如 anchor 的数量是：20000，那么shape = [20000,4] 代表每个anchor 需要偏移的位置

smooth_l1_loss 计算公式：
涉及配置项：cfg.MODEL.RPN.SMOOTH_L1_BETA默认值是 0.0

公式为：
              | 0.5 * x ** 2 / beta   if abs(x) < beta
smoothl1(x) = |
              | abs(x) - 0.5 * beta   otherwise,

where x = input - target.

具体代码：fvcore.nn.smooth_l1_loss.smooth_l1_loss 实现的开头几行有点特殊：

if beta < 1e-5:
   # if beta == 0, then torch.where will result in nan gradients when
   # the chain rule is applied due to pytorch implementation details
   # (the False branch "0.5 * n ** 2 / 0" has an incoming gradient of
   # zeros, rather than "no gradient"). To avoid this issue, we define
   # small values of beta to be exactly l1 loss.
   loss = torch.abs(input - target)

但是可以发现，如果默认值是 0 的话，会导致除数为0。所以如果 beta < 0.00005的话（特别小），loss 就直接变成 torch.abs(input - target)

设计目标其二：希望预测的 anchor , 并且被抽样出来这部分，是正样本还是负样本的预测和真值一致。
对应到代码即： pred_objectness_logits 和 gt_labels 一致

objectness_loss = F.binary_cross_entropy_with_logits(
   cat(pred_objectness_logits, dim=1)[valid_mask],
   gt_labels[valid_mask].to(torch.float32),
   reduction="sum",
)

直接调用了 pytorch 提供的交叉熵损失函数。有关交叉熵损失的解释看参考：
https://blog.csdn.net/u014313009/article/details/51043064 （虽然我没看懂，但我觉得他写得挺好的）

最终所有的loss 包含 2大部分：

proposal_losses (anchor -> proposal 生成网络) :
— 正负样本 loss
— anchor 生成框的delta 损失
这个第一部分，已经在上面详细描述。

如下第二部分还要再详细补充：
2. detector_losses （proposal -> roi 生成网络） :
— 分类损失
— roi 生成框的delta 损失
— mask 损失

最终把这些所有的损失全部加起来
对比 faster-rcnn 论文中的内容：
在这里插入图片描述

论文中说了 λ 对结果并不敏感，所以在源代码中确实也未发现有这个值。

一些思考

我们的项目是关于识别孩子作业的印刷体和手写体。。

思考一：

一个作业卷面上会有非常多的印刷体和手写体,一张图片上的正样本对应的 anchor 很可能超过1000个
正样本总取样数量 cfg.MODEL.RPN.BATCH_SIZE_PER_IMAGE 的值，默认是256 是否足够呢？
我感觉是可以尽可能提高下的。

思考二：

根据代码调试，smoothl1 并没有真正用起来，因为 beta 默认值是 0 。这里要不要调整呢？
他原来的默认值为什么是0 ，没搞明白

原文链接：https://blog.csdn.net/weixin_41946623/article/details/114178666