
1. 说在前面的话

在目标检测领域Faster RCNN可以说是无人不知无人不晓,它里面有一个网络结构RPN(Region Proposal Network)用于在特征图上产生候选预测区域。但是呢,这个网络结构具体是怎么工作的呢?网上有很多种解释,但是都是云里雾里的,还是直接撸代码来得直接,这里就直接从代码入手直接撸吧-_-||。
首先,来看一下Faster RCNN中RPN的结构是什么样子的吧。可以看到RPN直接通过一个卷积层rpn_conv/3x3直接接在了分类网络的特征层输出上面,之后接上两个卷积层rpn_clc_score与rpn_bbox_pred分别用于产生前景背景分类与预测框。之后再由python层AnchorTargetLayer产生anchor机制的分类与预测框。然后,经过ROI Proposal产生ROI区域的候选,并通过ROI Pooling规范到相同的尺寸上进行后续处理。大体的结构如下图所示:
Generates a regular grid of multi-scale, multi-aspect anchor boxes.
Converts RPN outputs (per-anchor scores and bbox regression estimates) into object proposals.
Generates training targets/labels for each anchor. Classification labels are 1 (object), 0 (not object) or -1 (ignore).
Bbox regression targets are specified when the classification label is > 0.
为每个目标候选生成训练目标或标签,分类标签从0K0-K(背景0或目标类别1,,K1, \dots, K),自然lable值大于0的才被指定预测框回归。
Generates training targets/labels for each object proposal: classification labels 0 - K (bg or object class 1, … , K)
and bbox regression targets in that case that the label is > 0.
Generate object detection proposals from an imdb using an RPN.

2. RPN网络部分


def setup(self, bottom, top):
    layer_params = yaml.load(self.param_str_)
    anchor_scales = layer_params.get('scales', (8, 16, 32)) # 尺度变化参数
    self._anchors = generate_anchors(scales=np.array(anchor_scales)) # 生成默认的9个anchor
    self._num_anchors = self._anchors.shape[0]
    self._feat_stride = layer_params['feat_stride']

    # allow boxes to sit over the edge by a small amount
	# 设为0,则取出任何超过图像边界的proposals,只要超出一点点,都要去除
    self._allowed_border = layer_params.get('allowed_border', 0)

    height, width = bottom[0].data.shape[-2:]
    if DEBUG:
        print 'AnchorTargetLayer: height', height, 'width', width

        A = self._num_anchors
    # labels 是否为目标的分类
    top[0].reshape(1, 1, A * height, width)
    # bbox_targets
    top[1].reshape(1, A * 4, height, width)
    # bbox_inside_weights
    top[2].reshape(1, A * 4, height, width)
    # bbox_outside_weights
top[3].reshape(1, A * 4, height, width)


# 1. Generate proposals from bbox deltas and shifted anchors
# x方向的偏移个数,大小为特征图的width
shift_x = np.arange(0, width) * self._feat_stride
# y方向的偏移个数,大小为特征图的height
shift_y = np.arange(0, height) * self._feat_stride
# shift_x,shift_y均为width×height的二维数组(meshgrid生成),对应位置的元素组合即构成图像上需要偏移量大小
# 这些偏移值对与初始的anchor相加即可得到
# 所有的anchors,所以总共会产生width×height×9个anchors,且存储在all_anchors变量中
shift_x, shift_y = np.meshgrid(shift_x, shift_y)
shifts = np.vstack((shift_x.ravel(), shift_y.ravel(),
                    shift_x.ravel(), shift_y.ravel())).transpose() # 维度输出为(width*height)*4
# add A anchors (1, A, 4) to
# cell K shifts (K, 1, 4) to get
# shift anchors (K, A, 4)
# reshape to (K*A, 4) shifted anchors
A = self._num_anchors
K = shifts.shape[0] # K=width*height
# 在之前9个anchor的基础上产生K*A个anchor,既是总的anchor数量
all_anchors = (self._anchors.reshape((1, A, 4)) +
               shifts.reshape((1, K, 4)).transpose((1, 0, 2)))
all_anchors = all_anchors.reshape((K * A, 4))
total_anchors = int(K * A) # 总的anchor数量


# only keep anchors inside the image 在图像内部的anchor,即是有效anchor,边界之外的删除掉
inds_inside = np.where(
    (all_anchors[:, 0] >= -self._allowed_border) &
    (all_anchors[:, 1] >= -self._allowed_border) &
    (all_anchors[:, 2] < im_info[1] + self._allowed_border) &  # width
    (all_anchors[:, 3] < im_info[0] + self._allowed_border)    # height


# label: 1 is positive, 0 is negative, -1 is dont care
# 图像内部anchor对应的分类,是否为目标的分类,大小为符合条件anchor的数量
labels = np.empty((len(inds_inside), ), dtype=np.float32)

在之前生成了计算需要的anchor了那么接下来就是需要计算anchor与gt之间的关系了,也就是使用overlap area的面积来度量,每个anchor的是否为目标分类也是根据这个度量来设置的。

# overlaps between the anchors and the gt boxes
# overlaps (ex, gt)返回维度为【anchors * gt_boxes】大小的二维数组
overlaps = bbox_overlaps(
    np.ascontiguousarray(anchors, dtype=np.float),
    np.ascontiguousarray(gt_boxes, dtype=np.float))
argmax_overlaps = overlaps.argmax(axis=1) # 求取于anchor重叠最大的gt
max_overlaps = overlaps[np.arange(len(inds_inside)), argmax_overlaps] # 取出与每个anchor重叠最大gt的重叠面积
gt_argmax_overlaps = overlaps.argmax(axis=0) # 求出与每个gt重叠面积最大的anchor
gt_max_overlaps = overlaps[gt_argmax_overlaps,
                                   np.arange(overlaps.shape[1])] # 取出与每个gt重叠面积最大的
gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0]

# 重叠面积小于阈值0.3的标注为0
    # assign bg labels first so that positive labels can clobber them
    labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0

# fg label: for each gt, anchor with highest overlap 与gt图重叠最大的对应anchor分类被设置为1
labels[gt_argmax_overlaps] = 1

# fg label: above threshold IOU 将与gt重叠的面积大于阈值0.7的anchor也将其分类设置为1
labels[max_overlaps >= cfg.TRAIN.RPN_POSITIVE_OVERLAP] = 1

    # assign bg labels last so that negative labels can clobber positives
    labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0

下面这两段代码是前一部分是在所有前景的anchor中选128个,后一部分是在所有的背景anchor中选128个。如果前景的个数少于了128个,就把所有的anchor选出来,差的由背景部分补。这和Fast RCNN选取ROI一样。

# subsample positive labels if we have too many 要是运行到这里得到的分类为1的太多了那就进行采样
# 从所有label为1的anchor中选择128个,剩下的anchor的label全部置为-1
num_fg = int(cfg.TRAIN.RPN_FG_FRACTION * cfg.TRAIN.RPN_BATCHSIZE) # 采样的阈值
fg_inds = np.where(labels == 1)[0]
if len(fg_inds) > num_fg:
    disable_inds = npr.choice(
        fg_inds, size=(len(fg_inds) - num_fg), replace=False)
    labels[disable_inds] = -1

# subsample negative labels if we have too many 要是被分类为非1的太多了那么也要进行采样
# 这里num_bg不是直接设为128,而是256减去label为1的个数,这样如果label为1的不够,就用label为0的填充,这个代码实现很巧
num_bg = cfg.TRAIN.RPN_BATCHSIZE - np.sum(labels == 1)
bg_inds = np.where(labels == 0)[0]
if len(bg_inds) > num_bg:
    disable_inds = npr.choice(
        bg_inds, size=(len(bg_inds) - num_bg), replace=False)
    labels[disable_inds] = -1

这个loss函数和Fast RCNN中的loss函数差不多,所以在计算的时候是每个坐标单独进行smoothL1计算,所以参数PiPi^*NregN_{reg}必须弄成4维的向量,并不是在论文中的就一个数值。
NregN_{reg}是进行标准化操作,就是取平均。这个平均是把所有的label 0和label 1加起来。因为选的是256个anchor做训练,所以实际上这个值是1256\frac{1}{256}

bbox_targets = np.zeros((len(inds_inside), 4), dtype=np.float32) # 之前anchor过滤之后与之对应的bbox
bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :]) # 计算anchor框与gt框之间的残差用于回归

bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
bbox_inside_weights[labels == 1, :] = np.array(cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHTS)

bbox_outside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
# 对样本权重进行归一化
    # uniform weighting of examples (given non-uniform sampling)
    num_examples = np.sum(labels >= 0)
    positive_weights = np.ones((1, 4)) * 1.0 / num_examples
    negative_weights = np.ones((1, 4)) * 1.0 / num_examples
    assert ((cfg.TRAIN.RPN_POSITIVE_WEIGHT > 0) &
            (cfg.TRAIN.RPN_POSITIVE_WEIGHT < 1))
    positive_weights = (cfg.TRAIN.RPN_POSITIVE_WEIGHT /
                        np.sum(labels == 1))
    negative_weights = ((1.0 - cfg.TRAIN.RPN_POSITIVE_WEIGHT) /
                                np.sum(labels == 0))
bbox_outside_weights[labels == 1, :] = positive_weights
bbox_outside_weights[labels == 0, :] = negative_weights


# map up to original set of anchors
# 主要是将长度为len(inds_inside)的数据映射回长度total_anchors的数据,total_anchors=(width*height)×9
labels = _unmap(labels, total_anchors, inds_inside, fill=-1)
bbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, fill=0)
bbox_inside_weights = _unmap(bbox_inside_weights, total_anchors, inds_inside, fill=0)
bbox_outside_weights = _unmap(bbox_outside_weights, total_anchors, inds_inside, fill=0)

值得注意的是,rpn网络的训练是256个anchor,128个positive,128个negative。但anchor_target_layer层的输出并不是只有256个anchor的label和坐标变换,而是所有的anchor。其中_unmap函数就很好体现了这一点。那训练的时候怎么实现训练这256个呢?实际上,这一层的4个输出,rpn_labels是需要输出到rpn_loss_cls层,其他的3个输出到rpn_loss_bbox,label实际上就是loss function前半部分中的PiPi^*(即计算分类的loss),这是一个log loss,为-1的label是无法进行log计算的,剩下的0、1就直接计算,这一部分实现了256。loss function后半部分是计算bbox坐标的loss,PiPi^*,也就是bbox_inside_weights,论文中说了activated only for positive anchors,只有为正例的anchor才去计算坐标的损失,这是PiPi^*是1,其他情况都是0。所以呢,只有那256个才真正改变了loss值,其他的都是0。

bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
bbox_inside_weights[labels == 1, :] = np.array(cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHTS)



# labels
labels = labels.reshape((1, height, width, A)).transpose(0, 3, 1, 2)
labels = labels.reshape((1, 1, A * height, width))
top[0].data[...] = labels

# bbox_targets
bbox_targets = bbox_targets \
    .reshape((1, height, width, A * 4)).transpose(0, 3, 1, 2)
top[1].data[...] = bbox_targets

# bbox_inside_weights
bbox_inside_weights = bbox_inside_weights \
    .reshape((1, height, width, A * 4)).transpose(0, 3, 1, 2)
assert bbox_inside_weights.shape[2] == height
assert bbox_inside_weights.shape[3] == width
top[2].data[...] = bbox_inside_weights

# bbox_outside_weights
bbox_outside_weights = bbox_outside_weights \
    .reshape((1, height, width, A * 4)).transpose(0, 3, 1, 2)
assert bbox_outside_weights.shape[2] == height
assert bbox_outside_weights.shape[3] == width
 top[3].data[...] = bbox_outside_weights



3. ROI Proposal网络部分

3.1 ProposalLayer

该层有3个输入:fg/bg anchors分类器结果rpn_cls_prob_reshape,对应的bbox reg的[dx(A)dy(A)dw(A)dh(A)][dx(A),dy(A),dw(A),dh(A)]变换量rpn_bbox_pred,以及im_info;另外还有参数feat_stride=16。
缩进首先解释im_info。对于一副任意大小图像,传入Faster RCNN前首先reshape到固定MNM*Nim_info=[M,N,scale_factor]im\_info=[M, N, scale\_factor]则保存了此次缩放的所有信息。然后经过Conv Layers,经过4次pooling变为(M/16)(N/16)(M/16)*(N/16)大小,其中feature_stride=16feature\_stride=16则保存了该信息。所有这些数值都是为了将proposal映射回原图而设置的。

def setup(self, bottom, top):
    # parse the layer parameter string, which must be valid YAML
    layer_params = yaml.load(self.param_str_)

    self._feat_stride = layer_params['feat_stride']
    anchor_scales = layer_params.get('scales', (8, 16, 32))
    self._anchors = generate_anchors(scales=np.array(anchor_scales)) # 产生默认的9个anchor
    self._num_anchors = self._anchors.shape[0]

    if DEBUG:
        print 'feat_stride: {}'.format(self._feat_stride)
        print 'anchors:'
        print self._anchors

    # rois blob: holds R regions of interest, each is a 5-tuple
    # (n, x1, y1, x2, y2) specifying an image batch index n and a
    # rectangle (x1, y1, x2, y2)
    top[0].reshape(1, 5)

    # scores blob: holds scores for R regions of interest
    if len(top) > 1:
        top[1].reshape(1, 1, 1, 1)


cfg_key = str(self.phase) # either 'TRAIN' or 'TEST' 阶段为train和test的时候nms的输入输出数目不一样
# Number of top scoring boxes to keep before apply NMS to RPN proposals
# 对RPN结果使用NMS之前需要保留的框
pre_nms_topN  = cfg[cfg_key].RPN_PRE_NMS_TOP_N # 12000
# Number of top scoring boxes to keep after applying NMS to RPN proposals
# 对RPN结果使用NMS之后需要保留的框
post_nms_topN = cfg[cfg_key].RPN_POST_NMS_TOP_N # 1200
## NMS threshold used on RPN proposals 使用nms时候的阈值
nms_thresh    = cfg[cfg_key].RPN_NMS_THRESH # 0.7
# Proposal height and width both need to be greater than RPN_MIN_SIZE (at orig image scale)
min_size      = cfg[cfg_key].RPN_MIN_SIZE # 16

# the first set of _num_anchors channels are bg probs
# the second set are the fg probs, which we want
# 前9个通道为背景类;后9个通道为非背景类
scores = bottom[0].data[:, self._num_anchors:, :, :] # 预测的分类(卷积输出:18)
bbox_deltas = bottom[1].data # 预测框的偏移量
im_info = bottom[2].data[0, :] # 图像的信息


# 1. Generate proposals from bbox deltas and shifted anchors
height, width = scores.shape[-2:]

    print 'score map size: {}'.format(scores.shape)

# Enumerate all shifts 这部分同anchor_target_layer
shift_x = np.arange(0, width) * self._feat_stride
shift_y = np.arange(0, height) * self._feat_stride
shift_x, shift_y = np.meshgrid(shift_x, shift_y)
shifts = np.vstack((shift_x.ravel(), shift_y.ravel(),
                    shift_x.ravel(), shift_y.ravel())).transpose()

# Enumerate all shifted anchors:
# add A anchors (1, A, 4) to
# cell K shifts (K, 1, 4) to get
# shift anchors (K, A, 4)
# reshape to (K*A, 4) shifted anchors
A = self._num_anchors
K = shifts.shape[0]
anchors = self._anchors.reshape((1, A, 4)) + \
                  shifts.reshape((1, K, 4)).transpose((1, 0, 2))
anchors = anchors.reshape((K * A, 4))

# Transpose and reshape predicted bbox transformations to get them
# into the same order as the anchors:
# bbox deltas will be (1, 4 * A, H, W) format
# transpose to (1, H, W, 4 * A)
# reshape to (1 * H * W * A, 4) where rows are ordered by (h, w, a)
# in slowest to fastest order
bbox_deltas = bbox_deltas.transpose((0, 2, 3, 1)).reshape((-1, 4))

# Same story for the scores:
# scores are (1, A, H, W) format
# transpose to (1, H, W, A)
# reshape to (1 * H * W * A, 1) where rows are ordered by (h, w, a)
scores = scores.transpose((0, 2, 3, 1)).reshape((-1, 1))

# Convert anchors into proposals via bbox transformations
# 利用 bbox_deltas 对anchors进行修正,得到proposals的预测位置,可以参考论文中公式
# 对于x,y使用线性变换,对于w,h使用exp
proposals = bbox_transform_inv(anchors, bbox_deltas)


# 2. clip predicted boxes to image
# 剪裁预测框到图像的边界内
proposals = clip_boxes(proposals, im_info[:2])


# 3. remove predicted boxes with either height or width < threshold
# (NOTE: convert min_size to input image scale stored in im_info[2])
# 去除长宽小于16的预测框,因为进行过4次Pooling呀
keep = _filter_boxes(proposals, min_size * im_info[2])
proposals = proposals[keep, :]
scores = scores[keep]


# 4. sort all (proposal, score) pairs by score from highest to lowest
# 5. take top pre_nms_topN (e.g. 6000) 选出Top_N,后面再进行 NMS,见前面的设置
order = scores.ravel().argsort()[::-1]
if pre_nms_topN > 0:
    order = order[:pre_nms_topN]
proposals = proposals[order, :] # 保留了前pre_nms_topN个框的坐标信息
scores = scores[order] # 保留了前pre_nms_topN个框的分数信息


# 6. apply nms (e.g. threshold = 0.7)
# 7. take after_nms_topN (e.g. 300)
# 8. return the top proposals (-> RoIs top) 对预测框进行nms
keep = nms(np.hstack((proposals, scores)), nms_thresh)
if post_nms_topN > 0:
    keep = keep[:post_nms_topN]
proposals = proposals[keep, :] # 对nms之后的预测框取前after_nms_topN个
scores = scores[keep]


# Output rois blob
# Our RPN implementation only supports a single input image, so all
# batch inds are 0
batch_inds = np.zeros((proposals.shape[0], 1), dtype=np.float32)
blob = np.hstack((batch_inds, proposals.astype(np.float32, copy=False)))
top[0].data[...] = blob

# [Optional] output scores blob
if len(top) > 1:
    top[1].data[...] = scores

3.2 ProposalTargetLayer


def setup(self, bottom, top):
    layer_params = yaml.load(self.param_str_)
    self._num_classes = layer_params['num_classes']

    # sampled rois (0, x1, y1, x2, y2)
    top[0].reshape(1, 5)
    # labels
    top[1].reshape(1, 1)
    # bbox_targets
    top[2].reshape(1, self._num_classes * 4)
    # bbox_inside_weights
    top[3].reshape(1, self._num_classes * 4)
    # bbox_outside_weights
    top[4].reshape(1, self._num_classes * 4)


def forward(self, bottom, top):
    # Proposal ROIs (0, x1, y1, x2, y2) coming from RPN
    # (i.e., rpn.proposal_layer.ProposalLayer), or any other source
    all_rois = bottom[0].data # RPN预测框,维度为[N,5]
    # GT boxes (x1, y1, x2, y2, label)
    # TODO(rbg): it's annoying that sometimes I have extra info before
    # and other times after box coordinates -- normalize to one format
    gt_boxes = bottom[1].data # GT信息,维度[M,5]

    # Include ground-truth boxes in the set of candidate rois
    # 将ground truth框加入到待分类的框里面(相当于增加正样本个数)
    # all_rois输出维度[N+M,5],前一维表示是从RPN的输出选出的框和ground truth框合在一起了
    zeros = np.zeros((gt_boxes.shape[0], 1), dtype=gt_boxes.dtype)
    all_rois = np.vstack(
        (all_rois, np.hstack((zeros, gt_boxes[:, :-1])))
    ) # 先在每个ground truth框前面插入0(这样才能和N个从RPN的输出选出的框对齐),然后把ground truth框插在最后

    # Sanity check: single batch only
    assert np.all(all_rois[:, 0] == 0), \
        'Only single item batches are supported'

    num_images = 1
    rois_per_image = cfg.TRAIN.BATCH_SIZE / num_images #cfg.TRAIN.BATCH_SIZE为128
    # cfg.TRAIN.FG_FRACTION为0.25,即在一次分类训练中前景框只能有32个
    fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image)

    # Sample rois with classification labels and bounding box regression
    # targets
    # _sample_rois选择进行分类训练的框,并求取他们类别和坐标的ground truth和计算边框损失loss时需要的bbox_inside_weights
    labels, rois, bbox_targets, bbox_inside_weights = _sample_rois(
        all_rois, gt_boxes, fg_rois_per_image,
        rois_per_image, self._num_classes)

    if DEBUG:
        print 'num fg: {}'.format((labels > 0).sum())
        print 'num bg: {}'.format((labels == 0).sum())
        self._count += 1
        self._fg_num += (labels > 0).sum()
        self._bg_num += (labels == 0).sum()
        print 'num fg avg: {}'.format(self._fg_num / self._count)
        print 'num bg avg: {}'.format(self._bg_num / self._count)
        print 'ratio: {:.3f}'.format(float(self._fg_num) / float(self._bg_num))

    # sampled rois  采样之后最终保留的全部预测框
    top[0].data[...] = rois

    # classification labels 预测框的分类
    top[1].data[...] = labels

    # bbox_targets 预测框与GT的残差
    top[2].data[...] = bbox_targets

    # bbox_inside_weights
    top[3].data[...] = bbox_inside_weights

    # bbox_outside_weights
    top[4].data[...] = np.array(bbox_inside_weights > 0).astype(np.float32)


def _sample_rois(all_rois, gt_boxes, fg_rois_per_image, rois_per_image, num_classes):
    """Generate a random sample of RoIs comprising foreground and background
    # overlaps: (rois x gt_boxes)
    # 计算所有roi和ground truth框之间的重合度
    # 只取坐标信息,roi中取第二到第五个数(因为补0了呀),ground truth框中取第一到第四个数
    overlaps = bbox_overlaps(
        np.ascontiguousarray(all_rois[:, 1:5], dtype=np.float),
        np.ascontiguousarray(gt_boxes[:, :4], dtype=np.float))
    gt_assignment = overlaps.argmax(axis=1) # 对于每个roi,找到对应的gt_box坐标 shape: [len(all_rois),]
    max_overlaps = overlaps.max(axis=1) # 对于每个roi,找到与gt_box重合的最大的overlap shape: [len(all_rois),]
    labels = gt_boxes[gt_assignment, 4] #对于每个roi,找到归属的类别: [len(all_rois),]

    # Select foreground RoIs as those with >= FG_THRESH overlap
    # 大于阈值的实际前景的数量
    fg_inds = np.where(max_overlaps >= cfg.TRAIN.FG_THRESH)[0]
    # Guard against the case when an image has fewer than fg_rois_per_image
    # foreground RoIs 求取用于回归的前景框数量
    fg_rois_per_this_image = min(fg_rois_per_image, fg_inds.size)
    # Sample foreground regions without replacement
    # 如果需要的话,就随机地排除一些前景框
    if fg_inds.size > 0:
        fg_inds = npr.choice(fg_inds, size=fg_rois_per_this_image, replace=False)

    # Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
    # 找到属于背景的rois(就是与gt_box覆盖介于0和0.5之间的)
    bg_inds = np.where((max_overlaps < cfg.TRAIN.BG_THRESH_HI) &
                       (max_overlaps >= cfg.TRAIN.BG_THRESH_LO))[0]
    # Compute number of background RoIs to take from this image (guarding
    # against there being fewer than desired)
    bg_rois_per_this_image = rois_per_image - fg_rois_per_this_image # 128-32个
    bg_rois_per_this_image = min(bg_rois_per_this_image, bg_inds.size) # 以下操作同fg
    # Sample background regions without replacement
    if bg_inds.size > 0:
        bg_inds = npr.choice(bg_inds, size=bg_rois_per_this_image, replace=False)

    # The indices that we're selecting (both fg and bg)
    keep_inds = np.append(fg_inds, bg_inds) # 记录一下运算之后最终保留的框
    # Select sampled values from various arrays:
    labels = labels[keep_inds]  # 记录一下最终保留的框对应的label
    # Clamp labels for the background RoIs to 0
    labels[fg_rois_per_this_image:] = 0 # 把背景框的分类置0
    rois = all_rois[keep_inds] # 取出最终保留的rois

    # 得到最终保留的框的类别ground truth值和坐标变换ground truth值,得到预测框的误差
    bbox_target_data = _compute_targets(
        rois[:, 1:5], gt_boxes[gt_assignment[keep_inds], :4], labels)

    # 得到最终计算loss时使用的ground truth边框回归值和bbox_inside_weights
    bbox_targets, bbox_inside_weights = \
        _get_bbox_regression_labels(bbox_target_data, num_classes)

    return labels, rois, bbox_targets, bbox_inside_weights


def _compute_targets(ex_rois, gt_rois, labels):
    """Compute bounding-box regression targets for an image."""

    assert ex_rois.shape[0] == gt_rois.shape[0]
    assert ex_rois.shape[1] == 4
    assert gt_rois.shape[1] == 4

    targets = bbox_transform(ex_rois, gt_rois) # 获得预测框与gt的残差
        # Optionally normalize targets by a precomputed mean and stdev
        targets = ((targets - np.array(cfg.TRAIN.BBOX_NORMALIZE_MEANS))
                / np.array(cfg.TRAIN.BBOX_NORMALIZE_STDS))
    # 将残差插到lable的后面(水平插入)
    return np.hstack(
            (labels[:, np.newaxis], targets)).astype(np.float32, copy=False)


def _get_bbox_regression_labels(bbox_target_data, num_classes):
    """Bounding-box regression targets (bbox_target_data) are stored in a
    compact form N x (class, tx, ty, tw, th)

    This function expands those targets into the 4-of-4*K representation used
    by the network (i.e. only one class has non-zero targets).

        bbox_target (ndarray): N x 4K blob of regression targets
        bbox_inside_weights (ndarray): N x 4K blob of loss weights

    clss = bbox_target_data[:, 0]  # 每个预测框通过重叠面积与gt比较得到的分类
    # 对应分类上预测框的误差
    bbox_targets = np.zeros((clss.size, 4 * num_classes), dtype=np.float32)
    # 用全0初始化一下bbox_inside_weights
    bbox_inside_weights = np.zeros(bbox_targets.shape, dtype=np.float32)
    inds = np.where(clss > 0)[0] # 非背景类
    for ind in inds:
        cls = clss[ind]
        start = 4 * cls # 找到从属的类别对应的坐标回归值的起始位置
        end = start + 4  # 找到从属的类别对应的坐标回归值的结束位置
        bbox_targets[ind, start:end] = bbox_target_data[ind, 1:]  #在对应类的坐标回归上置相应的值(预测框误差)
        # 将bbox_inside_weights上的对应类的坐标回归值置1
        bbox_inside_weights[ind, start:end] = cfg.TRAIN.BBOX_INSIDE_WEIGHTS # (1.0, 1.0, 1.0, 1.0)
    return bbox_targets, bbox_inside_weights

4. ROI Pooling

关于ROI Pooling Layer的解读

5. REF

  1. anchor_target_layer层其他部分解读
  2. 详细的Faster R-CNN源码解析之proposal_layer和proposal_target_layer源码解析
  3. Faster RCNN原理分析(二):Region Proposal Networks详解