[ICCV 2019] YOLACT Real-time Instance Segmentation

YOLACT Real-time Instance Segmentation

1. Author

Daniel Bolya, Chong Zhou, Fanyi Xiao, Yong Jae Lee
University of California, Davis

2. Abstract

We present a simple, fully-convolutional model for realtime instance segmentation that achieves 29.8 mAP on MS COCO at 33.5 fps evaluated on a single Titan Xp, which is significantly faster than any previous competitive approach.

Moreover, we obtain this result after training on only one GPU.

We accomplish this by breaking instance segmentation into two parallel subtasks:
(1) generating a set of prototype masks and
(2) predicting per-instance mask coefficients.
Then we produce instance masks by linearly combining the prototypes with the mask coefficients.

Finally, we also propose Fast NMS, a drop-in 12 ms faster replacement for standard NMS that only has a marginal performance penalty.

3. Introduction

In this work, our goal is to fill that gap with a fast, one-stage instance segmentation model in the same way that SSD and YOLO fill that gap for object detection.

[ICCV 2019] YOLACT Real-time Instance Segmentation
One-stage object detectors like SSD and YOLO are able to speed up existing two-stage detectors like Faster R-CNN by simply removing the second stage and making up for the lost performance in other ways.

State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.
These methods “repool” features in some bounding box region (e.g., via RoIpool/align), and then feed these now localized features to their mask predictor.

One-stage methods that perform these steps in parallel like FCIS do exist, but they require significant amounts of post-processing after localization, and thus are still far from real-time.

This approach also has several practical advantages.

  1. First and foremost, it’s fast: because of its parallel structure and extremely lightweight assembly process, YOLACTYOLACT adds only a marginal amount of computational overhead to a one-stage backbone detector, making it easy to reach 30 fps even when using ResNet101ResNet-101; in fact, the entire mask branch takes only ∼5 ms to evaluate.
  2. Second, masks are high-quality: since the masks use the full extent of the image space without any loss of quality from repooling, our masks for large objects are significantly higher quality than those of other methods
  3. Finally, it’s general: the idea of generating prototypes and mask coefficients could be added to almost any modern object detector.

4. YOLACT

[ICCV 2019] YOLACT Real-time Instance Segmentation
Our goal is to add a mask branch to an existing one-stage object detection model in the same vein as Mask R-CNN does to Faster R-CNN, but without an explicit feature localization step (e.g., feature repooling).

To do this, we break up the complex task of instance segmentation into two simpler, parallel tasks that can be assembled to form the final masks.

The first branch uses an FCN to produce a set of image-sized “prototype masks” that do not depend on any one instance.
The second adds an extra head to the object detection branch to predict a vector of “mask coefficients” for each anchor that encode an instance’s representation in the prototype space.
Finally, for each instance that survives NMS, we construct a mask for that instance by linearly combining the work of these two branches.

4.1 Rationale

Thus, we break the problem into two parallel parts, making use of fc layers, which are good at producing semantic vectors, and conv layers, which are good at producing spatially coherent masks, to produce the “mask coefficients” and “prototype masks”, respectively.

Because prototypes and mask coefficients can be computed independently, the computational overhead over that of the backbone detector comes mostly from the assembly step, which can be implemented as a single matrix multiplication.

In this way, we can maintain spatial coherence in the feature space while still being one-stage and fast.

4.2 Prototype Generation

[ICCV 2019] YOLACT Real-time Instance Segmentation
All supervision for these prototypes comes from the final mask loss after assembly.

We note two important design choices: taking protonet from deeper backbone features produces more robust masks, and higher resolution prototypes result in both higher quality masks and better performance on smaller objects.

4.3 Mask Coefficients

Typical anchor-based object detectors have two branches in their prediction heads: one branch to predict c class confidences, and the other to predict 4 bounding box regressors. For mask coefficient prediction, we simply add a third branch in parallel that predicts k mask coefficients, one corresponding to each prototype.

we apply tanh to the k mask coefficients, which produces more stable outputs over no nonlinearity.

4.4 Mask Assembly

To produce instance masks, we combine the work of the prototype branch and mask coefficient branch, using a linear combination of the former with the latter as coefficients.
[ICCV 2019] YOLACT Real-time Instance Segmentation
These operations can be implemented efficiently using a single matrix multiplication and sigmoid:
M=σ(PCT) M=\sigma\left(P C^{T}\right)
使用线性组合更加简单快速。

4.5 Emergent Behavior

[ICCV 2019] YOLACT Real-time Instance Segmentation
We observe many prototypes to activate on certain “partitions” of the image. That is, they only activate on objects on one side of an implicitly learned boundary.

Increasing k is ineffective most likely because predicting coefficients is difficult.

4.6 Backbone Detector

The design of our backbone detector closely follows RetinaNet with an emphasis on speed.

We apply smoothL1smooth-L_{1} loss to train box regressors and encode box regression coordinates in the same way as SSD.

Unlike RetinaNet we do not use focal loss, which we found not to be viable in our situation.

5. Results

[ICCV 2019] YOLACT Real-time Instance Segmentation

[ICCV 2019] YOLACT Real-time Instance Segmentation