[ICCV 2019] YOLACT Real-time Instance Segmentation
文章目录
YOLACT Real-time Instance Segmentation
1. Author
Daniel Bolya, Chong Zhou, Fanyi Xiao, Yong Jae Lee
University of California, Davis
2. Abstract
We present a simple, fully-convolutional model for realtime
instance segmentation
that achieves 29.8 mAP on MS COCO at 33.5 fps evaluated on a single Titan Xp, which is significantly faster
than any previous competitive approach.
Moreover, we obtain this result after training on only one GPU
.
We accomplish this by breaking instance segmentation into two parallel subtasks
:
(1) generating a set of prototype masks and
(2) predicting per-instance mask coefficients.
Then we produce instance masks by linearly combining the prototypes with the mask coefficients.
Finally, we also propose Fast NMS
, a drop-in 12 ms faster replacement for standard NMS that only has a marginal performance penalty.
3. Introduction
In this work, our goal is to fill that gap with a fast, one-stage instance segmentation
model in the same way that SSD and YOLO fill that gap for object detection.
One-stage object detectors like SSD and YOLO are able to speed up
existing two-stage detectors like Faster R-CNN by simply removing the second stage
and making up
for the lost performance in other ways.
State-of-the-art two-stage instance segmentation methods depend heavily on
feature localization to produce masks
.
These methods “repool”
features in some bounding box region (e.g., via RoIpool/align), and then feed these now localized features to their mask predictor.
One-stage methods that perform these steps in parallel like FCIS do exist, but they require significant amounts of post-processing
after localization, and thus are still far from real-time.
This approach also has several practical advantages
.
- First and foremost, it’s
fast
: because of its parallel structure and extremely lightweight assembly process, adds only a marginal amount of computational overhead to a one-stage backbone detector, making it easy to reach 30 fps even when using ; in fact, the entire mask branch takes only ∼5 ms to evaluate. - Second, masks are
high-quality
: since the masks use the full extent of the image space without any loss of quality from repooling, our masks for large objects are significantly higher quality than those of other methods - Finally, it’s
general
: the idea of generating prototypes and mask coefficients could be added to almost any modern object detector.
4. YOLACT
Our goal is to add a mask branch to an existing one-stage object detection model in the same vein as Mask R-CNN
does to Faster R-CNN
, but without an explicit feature localization step (e.g., feature repooling).
To do this, we break up the complex task of instance segmentation into two simpler, parallel
tasks that can be assembled to form the final masks.
The first branch uses an FCN to produce a set of image-sized “prototype masks”
that do not depend on any one instance.
The second adds an extra head to the object detection branch to predict a vector of “mask coefficients”
for each anchor that encode an instance’s representation in the prototype space.
Finally, for each instance that survives NMS
, we construct a mask for that instance by linearly combining the work of these two branches.
4.1 Rationale
Thus, we break the problem into two parallel parts, making use of fc
layers, which are good at producing semantic vectors
, and conv
layers, which are good at producing spatially coherent masks
, to produce the “mask coefficients” and “prototype masks”, respectively.
Because prototypes and mask coefficients can be computed independently
, the computational overhead over that of the backbone detector comes mostly from the assembly step, which can be implemented as a single matrix multiplication.
In this way, we can maintain spatial coherence in the feature space while still being one-stage and fast
.
4.2 Prototype Generation
All supervision for these prototypes comes from the final mask loss
after assembly.
We note two important design choices
: taking protonet from deeper backbone
features produces more robust masks, and higher resolution
prototypes result in both higher quality masks and better performance on smaller objects.
4.3 Mask Coefficients
Typical anchor-based object detectors have two branches in their prediction heads: one branch to predict c class confidences
, and the other to predict 4 bounding box regressors
. For mask coefficient prediction, we simply add a third branch in parallel that predicts k mask coefficients
, one corresponding to each prototype.
we apply tanh
to the k mask coefficients, which produces more stable outputs
over no nonlinearity.
4.4 Mask Assembly
To produce instance masks, we combine the work of the prototype branch and mask coefficient branch, using a linear combination of the former with the latter as coefficients.
These operations can be implemented efficiently using a single matrix multiplication and sigmoid
:
使用线性组合更加简单快速。
4.5 Emergent Behavior
We observe many prototypes to activate on certain “partitions” of the image. That is, they only activate on objects on one side
of an implicitly learned boundary.
Increasing k is ineffective most likely because predicting coefficients is difficult.
4.6 Backbone Detector
The design of our backbone detector closely follows RetinaNet with an emphasis on speed.
We apply loss to train box regressors and encode box regression coordinates in the same way as SSD.
Unlike RetinaNet we do not use focal loss, which we found not to be viable in our situation.