【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

Abstract

在Mask-RCNN的基础上加一个在一段video clip中可以propagate instance masks的模块。这样可以参照clip最中间那个instances segmentation 来predict clip-level的instance tracks

Method

MaskProp以一个video的随机长度L作为输入然后输出一个video-level的instance seg tracks M^i,以及类别c^i和置信度s^i。

首先我们的方法要先建立一个clip-level的object instance tracks,clip长度为2T+1。T值既要满足GPU显存大小,也要能handle一定遮挡和模糊的能力。

然后把L个clip的track整合。

4.1. Video Mask R-CNN

【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

loss:【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

t 代表一个clip里的centre frame,其中prop的loss如下

【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

其中【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation是instance i  由clip的center frame 预测得到的seg

【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation是t'的GT mask

【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagationin frame t

【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

p是每个pixel location;总体上这个loss是一个soft IOU loss,会比普通的CEloss要好。

4.2. Mask Propagation Branch

Overview

mask propagation branch是用来track instances的。给出一个视频clip,centered at frame t。对于frame t 中的每个object instance会给出clip-level instance masks。具体三个步骤:

1)instance-specific feature computation,提特征

2)temporal propagation of instance features,传特征

3)propagated instance segmentation,分割

【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

Computing Instance Specific Features

mask branch会先predict frame-level的instance masks 【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

然后用这些frame-level的instance masks来计算instance-specific feature for frame t:对于每个object i,我们计算element-wise product between 【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagationfeature from backbone。产生一些新的tensors【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation,总之就是把backbone feature里不属于这个objects的pixels排除。

Temporally Propagating Instance Features

给出frame-level【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation,我们的方法会产生一个propagated instance feature tensor 【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation,它代表由【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation产生的【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation帧里的object i 的feature。具体:用【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation得到的alignment来warping【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation。用一个deformable conv实现。具体的:计算element-wise difference of 【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation,然后输入到一个residual block,会产生motion offsets【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation这些偏移量包含k×k可变形卷积核每个项的(x,y)采样位置

propagation step 的输入由1)【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation2)【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation。然后用deformable conv产生【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

Segmenting Propagated Instances

 用【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation预测一个在【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation帧里相关的object mask。先建立一个新feature:【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

然后输入到一个1x1卷积里产生【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation帧里相关的object mask,然后会做一个softmax nonlinearity across all Nt instances

然后对于不属于任何object instance的pixels,我们会用一个3x3卷积来计算【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation的instance-agnostic attention map【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation。然后把【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation乘上每一个预测的instance masks。

4.3. Video-Level Segmentation Instances

得到每个clip里的每帧的分割后,还需要把他们连接起来。我们会分配给每个clip-level instance tracks一个video-level的instance ID,通过匹配ID来连接。

Matching Clip-Level Instance Tracks

考虑一对tacks【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation,一个是以t为中心,一个为t'。可能二者会由重合部分。记重合的时间区间为【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation,我们会通过比较重叠帧中预测的instance masks来比较他们是否match。用下面的式子计算出的一个matching score来确认是否是同一个objcet instance。

【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

Video-Level Instance ID Assignment

我们记【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation为video level IDs集合。通过从t=1到t=L匹配clip-level instance tracks来建立【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

首先我们初始化t=1时【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation代表分配给【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation的video-level 的ID。

然后t>1时,需要通过匹配和【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation在t时刻之前所有有重叠部分的【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation来分配一个ID【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

对于已经在【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation中的ID  y,我们会计算一个score【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation来衡量how well【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation和ID set中已有的tracks的match程度。

【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation

【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation是指示函数。

【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation是最大值。如果q*大于一个阈值,【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation就会被分配给一个ID【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation。否则,就代表这个clip track不匹配现存的所有ID,所以分配一个新的ID同时扩充ID set

最后,每个clip都会有一个ID:

【VIS】Classifying,Segmenting,and Tracking Object Instances in Video with Mask Propagation