Faster R-CNN 学习笔记

文章目录

一些非常好的博客

博客园. Faster R-CNN：详解目标检测的实现过程

Abstract

在 Faster R-CNN 刚提出的那个时候，主流的目标检测框架都依赖于 region proposal 算法来预先假定目标所在的位置。当时的两大有代表性的目标检测框架 —— SPPnet 和 Fast R-CNN，已经能够大大缩短整个目标检测框架的运行时间。然而，计算 region proposal 所消耗的时间代价始终没有得到解决，而且成为了影响整个目标检测框架速度的瓶颈。Faster R-CNN 要解决的就是这样一个问题：它引入了 Region Proposal Network (RPN) ，通过和后续的检测网络（即 Fast R-CNN）共享整张图片的卷积特征，在几乎不消耗任何代价的情况下提供 region proposal 。作者在将 RPN 和 Fast R-CNN 合并成一个网络的时候，是将 RPN 视为一种注意力机制 —— RPN 告诉整个网络 “where to look” 。

1 INTRODUCTION

在 Faster R-CNN 提出的那个年代，当时推动目标检测算法继续往前发展的两大主流做法包括：region proposal methods 以及 region-based convolutional neural networks (R-CNNs) 。R-CNN 在刚被提出来的时候，计算代价是非常大的。但是随着后续工作的改进（如 Fast R-CNN），即通过在 proposal 之间共享卷积操作，这些计算代价已经被大大减小。当时 Fast R-CNN 是这方面最新的工作，在不考虑 region proposal 阶段所消耗的时间的情况下，Fast R-CNN 几乎可以做到 real-time 。因此，在当时，计算 proposal 是测试阶段的主要计算瓶颈。

作者举了当时两个有代表性的例子 —— Selective Search 和 EdgeBoxes，来举例说明 region proposal step still consumes as much running time as the detection network.

作者还指出，当时的 region proposal 方法是在 CPU 上实现的。那么一种可以加速这一过程的方法就是在 GPU 上进行复现，但是这一做法完全忽略了下游的检测环节，导致它和后续的检测环节无法共享计算。

所以在这篇文章中，作者提出了一种 elegant and effective solution —— computing proposals with a deep convolutional neural network，即引入 RPN ，并让它和 SOTA 目标检测网络共享卷积层。

作者观察到，由 CNN 提取的 feature map，不仅可以被后续的 region-based detector (如 Fast R-CNN) 使用，还可以被用来生成 region proposal。在这些 feature map 的基础上，作者搭建了 RPN —— 即加一些额外的卷积层来同时预测目标框和 objectness score，而且这些预测是在一个 regular grid 的每个位置上进行的。因此 RPN 是一种 fully convolutional network (FCN)，这就使得它可以进行 end-to-end 训练，从而可以把生成 detection proposal 的过程纳入到整个目标检测框架中。

anchor 的概念又是怎么被提出来的呢？在当时，为了应对图像中目标多尺度的问题，有两种主流的做法 —— image pyramid 以及 filter pyramid，不同于这两种做法，作者提出了 reference pyramid，即 anchor boxes that serve as references at multiple scales and aspect ratios。reference pyramid 的好处是我们不用再去遍历 images or filters of multiple scales or aspect ratios，这样的话，我们只需要用 single-scale image 就可以进行训练和测试过程，因此对整个检测框架速度的提升大有裨益。

2 RELATED WORK

Object Proposals. 当时两大主流的 object proposal methods —— grouping super-pixels (e.g., Selective Search) and sliding window methods (e.g., EdgeBoxes)。不过当时的 object proposal methods 都是作为一种 external modules，它们和整个 object detector 是相互独立的。

Deep Networks for Object Detection. （未完待续）

代码实践：

注：所有代码均基于 mmdetection。

1. anchor 的生成

假设已知如下参数：

参数	用处
经过 backbone 之后得到的 feature map 的尺寸为 `(4, 3)`	这样的话整个 feature map 就可以划分成 12 个 grid，如果每个 grid 处只有 1 个 anchor，那么对这样一个 feature map 一共可以生成 12 个 anchor
降采样步长 stride = 4	即原图到当前的 feature map 经过了多少倍的缩放，如果想将当前 feature map 上生成的 anchor 映射到原图尺寸，则需要乘以这个缩放倍数
anchor scale = 8	anchor scale 数值的大小决定了最终生成的 anchor 尺寸的大小；anchor scale 数量的多少影响着在每一个 feature grid 处生成的 anchor 数量的多少
anchor aspect ratio = $1 : 2$ ， $2 : 1$ ， $1 : 1$	aspect ratio 的数值影响着 anchor 的形状，aspect ratio 取值的多少影响着在每一个 feature grid 处生成的 anchor 数量的多少

假设 anchor scale 的取值一共有 $n_s$ 个，aspect ratio 的取值一共有 $n_a$ 个，则在每一个 feature grid 处一共可以生成的 anchor 数目为：
$n_s \times n_a$

假设每一个 anchor 的中心在每一个 feature grid 的左上角，由已给的参数，在每一个 feature grid 处一共可以生成 3 个 anchor，如下所示：

base_anchors = torch.Tensor([[-22.6274, -11.3137,  22.6274,  11.3137], 
                             [-16.0000, -16.0000,  16.0000,  16.0000], # scale * stride = 8 * 4 = 32
                             [-11.3137, -22.6274,  11.3137,  22.6274]])

下面要做的就是将 base_anchors 平移到每一个 feature grid 处，平移的步长由 stride 的值决定，代码如下：

import torch


base_anchors = torch.Tensor([[-22.6274, -11.3137,  22.6274,  11.3137], 
                             [-16.0000, -16.0000,  16.0000,  16.0000], # scale * stride = 8 * 4 = 32
                             [-11.3137, -22.6274,  11.3137,  22.6274]])
stride = (4, 4)
featmap_size = torch.Size([4, 3])
feat_h, feat_w = featmap_size

# shift_x 表示在 x 方向上的平移
shift_x = torch.arange(0, feat_w) * stride[0]
shift_y = torch.arange(0, feat_h) * stride[1]

# shift_xx 表示 x 方向上的平移叠加上 y 方向上的平移（但是实际上我们还没有在 y 方向上进行平移，
# 此处可以理解为先做好在 y 方向上平移的准备）
shift_xx = shift_x.repeat(len(shift_y))
shift_yy = shift_y.view(-1, 1).repeat(1, len(shift_x)).view(-1)

# shifts 代表在整个 feature grid 上的平移矩阵，它是真正的已经进行平移过的平移矩阵
shifts = torch.stack([shift_xx, shift_yy, shift_xx, shift_yy], dim=-1)
shifts = shifts.type_as(base_anchors) # 和 base_anchors 统一一下数据类型

# 下面我们只需要将 base_anchors 和平移矩阵相加，就可以实现 base_anchors 在
# feature grid 上的滑动

# 让 shifts 的第一个维度充当 all_anchors 的第一个维度，是为了让滑动之后，同一个
# feature grid 处的 3 个 anchor 能够挨在一起。这样在后面将 all_anchors 进行 
# view() 操作之后，同一个 feature grid 处的 3 个 anchor 是相邻的
all_anchors = base_anchors[None, :, :] + shifts[:, None, :]
all_anchors = all_anchors.view(-1, 4)

print(f'all_anchors = \n{all_anchors}')

最终生成的 anchor 如下所示：

all_anchors = 
tensor([[-22.6274, -11.3137,  22.6274,  11.3137], # 以(0, 0)为中心的 anchor, scale=8, aspect ratio=1:2
        [-16.0000, -16.0000,  16.0000,  16.0000], # 以(0, 0)为中心的 anchor, scale=8, aspect ratio=1:1
        [-11.3137, -22.6274,  11.3137,  22.6274], # 以(0, 0)为中心的 anchor, scale=8, aspect ratio=2:1
        [-18.6274, -11.3137,  26.6274,  11.3137], # 以(4, 0)为中心的 anchor, scale=8, aspect ratio=1:2
        [-12.0000, -16.0000,  20.0000,  16.0000], # 以(4, 0)为中心的 anchor, scale=8, aspect ratio=1:1
        [ -7.3137, -22.6274,  15.3137,  22.6274], # 以(4, 0)为中心的 anchor, scale=8, aspect ratio=2:1
        [-14.6274, -11.3137,  30.6274,  11.3137], # 以(8, 0)为中心的 anchor, ...
        [ -8.0000, -16.0000,  24.0000,  16.0000], # 以(8, 0)为中心的 anchor
        [ -3.3137, -22.6274,  19.3137,  22.6274], # 以(8, 0)为中心的 anchor
        [-22.6274,  -7.3137,  22.6274,  15.3137], # 以(0, 4)为中心的 anchor
        [-16.0000, -12.0000,  16.0000,  20.0000], # 以(0, 4)为中心的 anchor
        [-11.3137, -18.6274,  11.3137,  26.6274], # 以(0, 4)为中心的 anchor
        [-18.6274,  -7.3137,  26.6274,  15.3137], # 以(0, 8)为中心的 anchor
        [-12.0000, -12.0000,  20.0000,  20.0000], # 以(0, 8)为中心的 anchor
        [ -7.3137, -18.6274,  15.3137,  26.6274], # 以(0, 8)为中心的 anchor
        [-14.6274,  -7.3137,  30.6274,  15.3137], # ...
        [ -8.0000, -12.0000,  24.0000,  20.0000],
        [ -3.3137, -18.6274,  19.3137,  26.6274],
        [-22.6274,  -3.3137,  22.6274,  19.3137],
        [-16.0000,  -8.0000,  16.0000,  24.0000],
        [-11.3137, -14.6274,  11.3137,  30.6274],
        [-18.6274,  -3.3137,  26.6274,  19.3137],
        [-12.0000,  -8.0000,  20.0000,  24.0000],
        [ -7.3137, -14.6274,  15.3137,  30.6274],
        [-14.6274,  -3.3137,  30.6274,  19.3137],
        [ -8.0000,  -8.0000,  24.0000,  24.0000],
        [ -3.3137, -14.6274,  19.3137,  30.6274],
        [-22.6274,   0.6863,  22.6274,  23.3137],
        [-16.0000,  -4.0000,  16.0000,  28.0000],
        [-11.3137, -10.6274,  11.3137,  34.6274],
        [-18.6274,   0.6863,  26.6274,  23.3137],
        [-12.0000,  -4.0000,  20.0000,  28.0000],
        [ -7.3137, -10.6274,  15.3137,  34.6274],
        [-14.6274,   0.6863,  30.6274,  23.3137],
        [ -8.0000,  -4.0000,  24.0000,  28.0000],
        [ -3.3137, -10.6274,  19.3137,  34.6274]])

补充材料：

import torch


def _meshgrid(x, y, row_major=True):
    xx = x.repeat(len(y)) # len(y) = 4
    # print(f'xx = {xx}')
    # Results:
    # xx = tensor([0, 4, 8, 0, 4, 8, 0, 4, 8, 0, 4, 8])
    yy = y.view(-1, 1).repeat(1, len(x)).view(-1)
    # print(f'yy_ = {y.view(-1, 1).repeat(1, len(x))}')
    # print(f'yy = {yy}')
    # Results:
    # yy_ = tensor([[ 0,  0,  0],
    #               [ 4,  4,  4],
    #               [ 8,  8,  8],
    #               [12, 12, 12]])
    # yy = tensor([ 0,  0,  0,  4,  4,  4,  8,  8,  8, 12, 12, 12])

    if row_major:
        return xx, yy
    else:
        return yy, xx

featmap_size = torch.Size([4, 3])
# print(f'featmap_size = {featmap_size}')
# Results: featmap_size = torch.Size([4, 3])
stride = (4, 4)
base_anchors = torch.Tensor([[-22.6274, -11.3137,  22.6274,  11.3137], 
                             [-16.0000, -16.0000,  16.0000,  16.0000], # scale * stride = 8 * 4 = 32
                             [-11.3137, -22.6274,  11.3137,  22.6274]])

feat_h, feat_w = featmap_size
feat_h, feat_w = int(feat_h), int(feat_w)
# print(f'before int, feat_h = {feat_h}, feat_w = {feat_w}')
# print(f'after int, feat_h = {feat_h}, feat_w = {feat_w}')
# Results:
# before int, feat_h = 4, feat_w = 3
# after int, feat_h = 4, feat_w = 3

shift_x = torch.arange(0, feat_w) * stride[0]
shift_y = torch.arange(0, feat_h) * stride[1]
# print(f'shift_x = {shift_x}, shift_y = {shift_y}')
# Results:
# shift_x = tensor([0, 4, 8]), shift_y = tensor([ 0,  4,  8, 12])

shift_xx, shift_yy = _meshgrid(shift_x, shift_y)
shifts = torch.stack([shift_xx, shift_yy, shift_xx, shift_yy], dim=-1)
# print(f'shifts = {shifts}')
# Results:
# shifts = tensor([[ 0,  0,  0,  0],
#                  [ 4,  0,  4,  0],
#                  [ 8,  0,  8,  0],
#                  [ 0,  4,  0,  4],
#                  [ 4,  4,  4,  4],
#                  [ 8,  4,  8,  4],
#                  [ 0,  8,  0,  8],
#                  [ 4,  8,  4,  8],
#                  [ 8,  8,  8,  8],
#                  [ 0, 12,  0, 12],
#                  [ 4, 12,  4, 12],
#                  [ 8, 12,  8, 12]])
shifts = shifts.type_as(base_anchors)
print(f'base_anchors.shape = {base_anchors.shape}, shifts.shape = {shifts.shape}')
# Results:
# base_anchors.shape = torch.Size([3, 4]), shifts.shape = torch.Size([12, 4])

all_anchors = base_anchors[None, :, :] + shifts[:, None, :]
print(f'all_anchors.shape = {all_anchors.shape}\nall_anchors = \n{all_anchors}')

all_anchors = all_anchors.view(-1, 4)
print(f'all_anchors.shape = {all_anchors.shape}\nall_anchors = \n{all_anchors}')

"""
Results:

base_anchors.shape = torch.Size([3, 4]), shifts.shape = torch.Size([12, 4])
all_anchors.shape = torch.Size([12, 3, 4])
all_anchors = 
tensor([[[-22.6274, -11.3137,  22.6274,  11.3137],
         [-16.0000, -16.0000,  16.0000,  16.0000],
         [-11.3137, -22.6274,  11.3137,  22.6274]],

        [[-18.6274, -11.3137,  26.6274,  11.3137],
         [-12.0000, -16.0000,  20.0000,  16.0000],
         [ -7.3137, -22.6274,  15.3137,  22.6274]],

        [[-14.6274, -11.3137,  30.6274,  11.3137],
         [ -8.0000, -16.0000,  24.0000,  16.0000],
         [ -3.3137, -22.6274,  19.3137,  22.6274]],

        [[-22.6274,  -7.3137,  22.6274,  15.3137],
         [-16.0000, -12.0000,  16.0000,  20.0000],
         [-11.3137, -18.6274,  11.3137,  26.6274]],

        [[-18.6274,  -7.3137,  26.6274,  15.3137],
         [-12.0000, -12.0000,  20.0000,  20.0000],
         [ -7.3137, -18.6274,  15.3137,  26.6274]],

        [[-14.6274,  -7.3137,  30.6274,  15.3137],
         [ -8.0000, -12.0000,  24.0000,  20.0000],
         [ -3.3137, -18.6274,  19.3137,  26.6274]],

        [[-22.6274,  -3.3137,  22.6274,  19.3137],
         [-16.0000,  -8.0000,  16.0000,  24.0000],
         [-11.3137, -14.6274,  11.3137,  30.6274]],

        [[-18.6274,  -3.3137,  26.6274,  19.3137],
         [-12.0000,  -8.0000,  20.0000,  24.0000],
         [ -7.3137, -14.6274,  15.3137,  30.6274]],

        [[-14.6274,  -3.3137,  30.6274,  19.3137],
         [ -8.0000,  -8.0000,  24.0000,  24.0000],
         [ -3.3137, -14.6274,  19.3137,  30.6274]],

        [[-22.6274,   0.6863,  22.6274,  23.3137],
         [-16.0000,  -4.0000,  16.0000,  28.0000],
         [-11.3137, -10.6274,  11.3137,  34.6274]],

        [[-18.6274,   0.6863,  26.6274,  23.3137],
         [-12.0000,  -4.0000,  20.0000,  28.0000],
         [ -7.3137, -10.6274,  15.3137,  34.6274]],

        [[-14.6274,   0.6863,  30.6274,  23.3137],
         [ -8.0000,  -4.0000,  24.0000,  28.0000],
         [ -3.3137, -10.6274,  19.3137,  34.6274]]])
all_anchors.shape = torch.Size([36, 4])
all_anchors = 
tensor([[-22.6274, -11.3137,  22.6274,  11.3137],
        [-16.0000, -16.0000,  16.0000,  16.0000],
        [-11.3137, -22.6274,  11.3137,  22.6274],
        [-18.6274, -11.3137,  26.6274,  11.3137],
        [-12.0000, -16.0000,  20.0000,  16.0000],
        [ -7.3137, -22.6274,  15.3137,  22.6274],
        [-14.6274, -11.3137,  30.6274,  11.3137],
        [ -8.0000, -16.0000,  24.0000,  16.0000],
        [ -3.3137, -22.6274,  19.3137,  22.6274],
        [-22.6274,  -7.3137,  22.6274,  15.3137],
        [-16.0000, -12.0000,  16.0000,  20.0000],
        [-11.3137, -18.6274,  11.3137,  26.6274],
        [-18.6274,  -7.3137,  26.6274,  15.3137],
        [-12.0000, -12.0000,  20.0000,  20.0000],
        [ -7.3137, -18.6274,  15.3137,  26.6274],
        [-14.6274,  -7.3137,  30.6274,  15.3137],
        [ -8.0000, -12.0000,  24.0000,  20.0000],
        [ -3.3137, -18.6274,  19.3137,  26.6274],
        [-22.6274,  -3.3137,  22.6274,  19.3137],
        [-16.0000,  -8.0000,  16.0000,  24.0000],
        [-11.3137, -14.6274,  11.3137,  30.6274],
        [-18.6274,  -3.3137,  26.6274,  19.3137],
        [-12.0000,  -8.0000,  20.0000,  24.0000],
        [ -7.3137, -14.6274,  15.3137,  30.6274],
        [-14.6274,  -3.3137,  30.6274,  19.3137],
        [ -8.0000,  -8.0000,  24.0000,  24.0000],
        [ -3.3137, -14.6274,  19.3137,  30.6274],
        [-22.6274,   0.6863,  22.6274,  23.3137],
        [-16.0000,  -4.0000,  16.0000,  28.0000],
        [-11.3137, -10.6274,  11.3137,  34.6274],
        [-18.6274,   0.6863,  26.6274,  23.3137],
        [-12.0000,  -4.0000,  20.0000,  28.0000],
        [ -7.3137, -10.6274,  15.3137,  34.6274],
        [-14.6274,   0.6863,  30.6274,  23.3137],
        [ -8.0000,  -4.0000,  24.0000,  28.0000],
        [ -3.3137, -10.6274,  19.3137,  34.6274]])

"""

2. 数据处理

COCO 数据集的存放格式如下：

data_root/
	|--- train2017/
			|--- 000000000009.jpg
			|--- 000000000025.jpg
			|--- 000000000030.jpg
			|--- etc. 共 118287 张图片
	|--- val2017/
			|--- 000000000139.jpg
			|--- 000000000285.jpg
			|--- 000000000632.jpg
			|--- etc. 共 5000 张图片
	|--- annotations/
			|--- instances_train2017.json
			|--- instances_val2017.json

那么原始的数据是如何一步一步变成我们最终想要的 imagemeta 的呢？

step 1: in mmdetection/tools/train.py:

from mmdet.datasets import build_dataset


def main():
	datasets = [build_dataset(cfg.data.train)]
"""
cfg.data.train 中的内容如下：
train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_train2017.json',
        img_prefix=data_root + 'train2017/',
        pipeline=train_pipeline)
其中，dataset_type = 'CocoDataset'，
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]，
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
"""

step 2: in mmdetection/mmdet/datasets/builder.py:

from mmcv.utils import Registry, build_from_cfg


def build_dataset(cfg, default_args=None):
	dataset = build_from_cfg(cfg, DATASETS, default_args)
	
	return dataset

step 3: in mmcv/utils/registry.py:

def build_from_cfg(cfg, registry, default_args=None):
	args = cfg.copy()
	obj_type = args.pop('type')
	if isinstance(obj_type, str):
		obj_cls = registry.get(obj_type)
	
	return obj_cls(**args)

step 4: 三重继承：

in mmdetection/mmdet/datasets/coco.py:

from .custom import CustomDataset

from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval


@DATASETS.register_module()
class CocoDataset(CustomDataset):
	...

in mmdetection/mmdet/datasets/custom.py:

from torch.utils.data import Dataset


@DATASETS.register_module()
class CustomDataset(Dataset):
	def __init__():
		...
		
	def __len__():
		...
``

Pytorch 中的 torch.utils.data.Dataset 和 torch.utils.data.DataLoader: https://blog.csdn.net/geter_CS/article/details/83378786
示例：（代码引用自上述链接）

class TxtDataset(Dataset):#这是一个Dataset子类
  def __init__(self):
      self.Data=np.asarray([[1,2],[3,4],[2,1],[6,4],[4,5]])#特征向量集合,特征是2维表示一段文本
      Label=np.asarray([1, 2, 0, 1, 2])#标签是1维,表示文本类别

  def __getitem__(self, index):
      txt=torch.LongTensor(self.Data[index])
      label=torch.LongTensor(self.Label[index])
      return txt, label #返回标签

  def __len__(self):
      return len(self.Data)

其中，__init__() 过程是在 custom.py 中完成的：

@DATASETS.register_module()
class CustomDataset(Dataset):
	def __init__(self,
				 ann_file,
				 pipeline,
				 data_root=None,
				 img_prefix=''):
		# load annotations (and proposals)
		self.data_infos = self.load_annotations(self.ann_file)

其中，load_annotations() 函数在 coco.py 中进行了重写：

@DATASETS.register_module()
class CocoDataset(CustomDataset):

    def load_annotations(self, ann_file):
        """Load annotation from COCO style annotation file.

        Args:
            ann_file (str): Path of annotation file.

        Returns:
            list[dict]: Annotation info from COCO api.
        """

        self.coco = COCO(ann_file)
        self.cat_ids = self.coco.get_cat_ids(cat_names=self.CLASSES)
        self.cat2label = {cat_id: i for i, cat_id in enumerate(self.cat_ids)}
        self.img_ids = self.coco.get_img_ids()
        # print(f'===> in coco.py, \nself.cat_ids = {self.cat_ids}'
        #       f'\nself.cat2label = {self.cat2label}'
        #       f'\nself.img_ids = {self.img_ids}')
        # Results:
        # self.cat_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 
        # 18, 19, 20, 21, 22, 23, 24, 25, 27, 28, 31, 32, 33, 34, 35, 36, 37, 
        # 38, 39, 40, 41, 42, 43, 44, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
        # 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 67, 70, 72, 73, 74, 75, 76, 
        # 77, 78, 79, 80, 81, 82, 84, 85, 86, 87, 88, 89, 90]

        # self.cat2label = {
        #   1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7, 9: 8, 10: 9, 
        #   11: 10, 13: 11, 14: 12, 15: 13, 16: 14, 17: 15, 18: 16, 19: 17, 
        #   20: 18, 21: 19, 22: 20, 23: 21, 24: 22, 25: 23, 27: 24, 28: 25, 
        #   31: 26, 32: 27, 33: 28, 34: 29, 35: 30, 36: 31, 37: 32, 38: 33, 
        #   39: 34, 40: 35, 41: 36, 42: 37, 43: 38, 44: 39, 46: 40, 47: 41, 
        #   48: 42, 49: 43, 50: 44, 51: 45, 52: 46, 53: 47, 54: 48, 55: 49, 
        #   56: 50, 57: 51, 58: 52, 59: 53, 60: 54, 61: 55, 62: 56, 63: 57, 
        #   64: 58, 65: 59, 67: 60, 70: 61, 72: 62, 73: 63, 74: 64, 75: 65, 
        #   76: 66, 77: 67, 78: 68, 79: 69, 80: 70, 81: 71, 82: 72, 84: 73, 
        #   85: 74, 86: 75, 87: 76, 88: 77, 89: 78, 90: 79
        # }
        # self.img_ids = [
        #   391895, 522418, 184613, ...
        # ]
        data_infos = []
        for i in self.img_ids:
            info = self.coco.load_imgs([i])[0]
            info['filename'] = info['file_name']
            data_infos.append(info)
        # print(f'===> data_infos[:2] = {data_infos[:2]}')
        # Results:
        # data_infos[:2] = [
        #     {
        #         'license': 3, 
        #         'file_name': '000000391895.jpg', 
        #         'coco_url': 'http://images.cocodataset.org/train2017/000000391895.jpg', 
        #         'height': 360, 
        #         'width': 640, 
        #         'date_captured': '2013-11-14 11:18:45', 
        #         'flickr_url': 'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg', 
        #         'id': 391895, 
        #         'filename': '000000391895.jpg'
        #     }, 
        #     {
        #         'license': 4, 
        #         'file_name': '000000522418.jpg', 
        #         'coco_url': 'http://images.cocodataset.org/train2017/000000522418.jpg', 
        #         'height': 480, 
        #         'width': 640, 
        #         'date_captured': '2013-11-14 11:38:44', 
        #         'flickr_url': 'http://farm1.staticflickr.com/1/127244861_ab0c0381e7_z.jpg', 
        #         'id': 522418, 
        #         'filename': '000000522418.jpg'
        #     }
        # ]
        
        return data_infos

load_annotations 函数的目的是为了得到 data_infos，data_infos 中有 4 个重要的信息，分别是：

height
width
id
filename

然后继续回到位于 custom.py 中的 __init__() 函数中：

@DATASETS.register_module()
class CustomDataset(Dataset):
	def __init__(self,
				 ann_file,
				 pipeline,
				 data_root=None,
				 img_prefix=''):
		# load annotations (and proposals)
		self.data_infos = self.load_annotations(self.ann_file)

		# filter images too small and containing no annotations
		if not test_mode:
			valid_inds = self._filter_imgs()
			self.data_infos = [self.data_infos[i] for i in valid_inds]
			# set group flag for the sampler
			self._set_group_flag()

这里边有一个过滤小图片和没有标注的图片的操作，即 _filter_imgs()，它是在 coco.py 中定义的：

@DATASETS.register_module()
class CocoDataset(CustomDataset):

    def _filter_imgs(self, min_size=32):
        """Filter images too small or without ground truths."""
        valid_inds = []
        # obtain images that contain annotation
        ids_with_ann = set(_['image_id'] for _ in self.coco.anns.values())
        # obtain images that contain annotations of the required categories
        ids_in_cat = set()
        for i, class_id in enumerate(self.cat_ids):
            ids_in_cat |= set(self.coco.cat_img_map[class_id])
        # merge the image id sets of the two conditions and use the merged set
        # to filter out images if self.filter_empty_gt=True
        ids_in_cat &= ids_with_ann

        valid_img_ids = []
        for i, img_info in enumerate(self.data_infos):
            img_id = self.img_ids[i]
            if self.filter_empty_gt and img_id not in ids_in_cat:
                continue
            if min(img_info['width'], img_info['height']) >= min_size:
                valid_inds.append(i)
                valid_img_ids.append(img_id)
        self.img_ids = valid_img_ids
        return valid_inds

通过该函数，以下三类图片都会被过滤掉：

没有 annotation 的；
图片尺寸小于 min_size * min_size 的；
类别不属于 self.cat_ids 的。

过滤完之后还有一个 _set_group_flag() 的操作，它是在 custom.py 中定义的：

@DATASETS.register_module()
class CustomDataset(Dataset):

    def _set_group_flag(self):
        """Set flag according to image aspect ratio.

        Images with aspect ratio greater than 1 will be set as group 1,
        otherwise group 0.
        """
        self.flag = np.zeros(len(self), dtype=np.uint8)
        for i in range(len(self)):
            img_info = self.data_infos[i]
            if img_info['width'] / img_info['height'] > 1:
                self.flag[i] = 1

最后再次回到位于 custom.py 中的 __init__() 函数中：

from .pipelines import Compose


@DATASETS.register_module()
class CustomDataset(Dataset):
	def __init__(self,
				 ann_file,
				 pipeline,
				 data_root=None,
				 img_prefix=''):
		# load annotations (and proposals)
		self.data_infos = self.load_annotations(self.ann_file)

		# filter images too small and containing no annotations
		if not test_mode:
			valid_inds = self._filter_imgs()
			self.data_infos = [self.data_infos[i] for i in valid_inds]
			# set group flag for the sampler
			self._set_group_flag()
		
		# processing pipeline
        self.pipeline = Compose(pipeline)

其中，pipeline 是一个 list[dict]：

train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]

Compose 函数是在 mmdetection/mmdet/datasets/pipelines/compose.py 中定义的：

@PIPELINES.register_module()
class Compose(object):
    """Compose multiple transforms sequentially.

    Args:
        transforms (Sequence[dict | callable]): Sequence of transform object or
            config dict to be composed.
    """

    def __init__(self, transforms):
        assert isinstance(transforms, collections.abc.Sequence)
        self.transforms = []
        for transform in transforms:
            if isinstance(transform, dict):
                transform = build_from_cfg(transform, PIPELINES)
                self.transforms.append(transform)
            elif callable(transform):
                self.transforms.append(transform)
            else:
                raise TypeError('transform must be callable or a dict')

    def __call__(self, data):
        """Call function to apply transforms sequentially.

        Args:
            data (dict): A result dict contains the data to transform.

        Returns:
           dict: Transformed data.
        """

        for t in self.transforms:
            data = t(data)
            if data is None:
                return None
        return data

Compose 类将 cfg 文件中指定的所有 pipeline 进行了实例化，self.pipeline 是 Compose 类实例化后的载体。

至此，由在 mmdetection/tools/train.py 中定义的

def main():
	datasets = [build_dataset(cfg.data.train)]

函数引发的所有的初始化工作均已完成，下面就将开始执行由Compose() 类中 __call__() 函数定义的内容。

但是以上所有的初始化工作只是完成了封装的第一步 —— 封装成 PyTorch 的 dataset，下一步还要继续将 dataset 封装成 dataloader 。这就要涉及到在 mmdetection/mmdet/apis/train.py 中定义的内容了：

from mmdet.datasets import build_dataloader


def train_detector():
	
	data_loaders = [
		build_dataloader(
			ds,
			cfg.data.samples_per_gpu,
			cfg.data.workers_per_gpu,
			len(cfg.gpu_ids),
			dist=distributed,
			seed=cfg.seed) for ds in dataset
	]

in mmdetection/mmdet/datasets/builder.py:

def build_dataloader():
	rank, world_size = get_dist_info()

其中，get_dist_info() 函数是在 mmcv/runner/dist_utils.py 中定义的：

from torch import distributed as dist


def get_dist_info():
    if TORCH_VERSION < '1.0':
        initialized = dist._initialized
    else:
        if dist.is_available():
            initialized = dist.is_initialized()
        else:
            initialized = False
    if initialized:
        rank = dist.get_rank()
        world_size = dist.get_world_size()
    else:
        rank = 0
        world_size = 1
    return rank, world_size

为了理解 torch.distributed.get_rank() 和 torch.distributed.get_world_size() 函数，我们首先需要了解 PyTorch 中分布式训练的概念：

rank 是一个分布式进程组组内的每个进程的唯一识别。get_rank() 函数用于返回当前进程的 rank。

get_world_size() 函数用于返回当前进程组内的进程数。

is_initialized() 函数用于检查默认进程组是否被初始化。

原文链接：https://blog.csdn.net/weixin_45595437/article/details/113793539