CornerNet-Lite: Efficient Keypoint-Based Object Detection

As was mentioned before, the good thing about CornerNet is its competitive results on MS COCO dataset. Nevertheless, it has one huge drawback.
It is slow.

To overcome this issue, the authors proposed CornerNet-Lite  –  a combination of two efficient variants of CornerNet

1 CornerNet-Saccade: It uses an attention mechanism to avoid the need for exhaustively processing all pixels of the image
2 CornerNet-Squeeze: It introduces a new compact backbone architecture.

These two methods improve the two critical features of efficient object detection: high efficiency without sacrificing accuracy, and high accuracy at real-time efficiency (Figure 1).

CornerNet-Lite: Efficient Keypoint-Based Object Detection

Figure 1: CornerNet-Saccade speeds up the original CornerNet by 6.0x with a 1% increase in AP. CornerNet-Squeeze is faster and more accurate than YOLOv3, the state-of-the-art real-time detector.
Let’s dive into both of these new methods and see what is so great about them.

CornerNet-Saccade: Overview

What does the word “saccade” mean?

Saccade refers to rapid eye movement that shifts the center of gaze from one part of the visual field to another. Saccades are mainly used for orienting gaze towards an object of interest.

The method is inspired by and derives its name from this natural phenomenon.

Figure 2 below shows an overview of CornerNet-Saccade. Let’s examine it in detail.

CornerNet-Lite: Efficient Keypoint-Based Object Detection

Estimating Object Locations

The network operates on the two scales of the input image. At the higher scale, the longer side of the image is resized to 255, and at the lower scale it is resized to 192 pixels. The image of size 192 is padded with zeros to the size of 255 so that both the scales can be processed in parallel.

For a downsized image, CornerNet-Saccade predicts 3 attention maps: one for small objects, one for medium objects and one for large objects.

The attention maps are predicted by using feature maps at different scales, obtained from the backbone network, which is an hourglass network.

Detecting Objects

The bounding boxes obtained from the downsized image may not be accurate and therefore are also examined at higher resolutions to get better bounding boxes.

At each possible location (x,y) the original image is zoomed-in by the object scale, depending on whether it’s a small, medium or large object.

Then CornerNet-Saccade is applied to a 255×255 window centered at the location for detecting possible bounding boxes.

Merging detections

Soft-NMS is applied to merge and remove the redundant bounding boxes.

Bounding boxes that are not fully covered by a region and touch the crop boundaries are also removed as they may have low overlaps with boxes of the full objects (Figure 3).

Detected bounding boxes are then ranked by their scores and only top k-max of them are selected.

CornerNet-Lite: Efficient Keypoint-Based Object Detection
Figure 3: Some objects may not be fully covered by a region. The detector may still generate bounding boxes (red dashed line) for those objects. These bounding boxes are removed.

New Backbone

A new hourglass 54-layer network is suggested (hence named Hourglass-54). Each of the 3 hourglass modules in the new architecture has fewer parameters and is shallower than the one in Hourglass-104.

Training Details

Input size: 255×255;
Adam optimization strategy;
The training hyperparameters are the same as in CornerNet;
Batch size is 48 on four 1080Ti GPUs.

CornerNet-Squeeze: Overview

Let’s move on to the second method suggested by the authors, which is called CornerNet-Squeeze. It incorporates ideas from SqueezeNet and MobileNets and reduces the amount of processing by:

Using new fire modules (Figure 4);
Hourglass module modifications:
reducing the maximum feature map resolution of the hourglass modules;
downsizing the image three times before the hourglass module, whereas CornerNet downsizes the image twice;
replacing the 3×3 filters with 1×1 filters in the prediction modules of CornerNet;
replacing the nearest neighbor upsampling with 4×4 transpose convolution.
CornerNet-Lite: Efficient Keypoint-Based Object Detection

Figure 4: Comparison between the residual block in Corner-Net and the fire module in CornerNet-Squeeze. Dwise stands for a depthwise convolution.

Training Details

Some training details mentioned in the paper:

The training hyperparameters and losses are the same as in CornerNet;
Batch size is 55;
4x1080Ti GPUs.

Results

Figure 5 shows the accuracy and efficiency trade-off curves of CornerNet-Saccade and CornerNet-Squeeze on the MS COCO validation set compared to other object detectors, including YOLOv3, RetinaNet and CornerNet:

CornerNet-Lite: Efficient Keypoint-Based Object Detection
Figure 5: Inference time and AP on MS COCO Dataset of CornerNet-Saccade and CornerNet-Squeeze compared to other state-of-the-art one-stage detectors.

CornerNet-Saccade achieves a better accuracy and efficiency trade-off (42.6% at 190 ms) than both RetinaNet (39.8% at 190 ms) and CornerNet (40.6% at 213 ms). CornerNet-Squeeze achieves better accuracy and efficiency (34.4% at 30 ms) trade-off than YOLOv3 (32.4% at 39 ms). Running CornerNet-Squeeze on both flipped and original images (Test Time Augmentation, TTA) improves its AP to 36.5% at 50 ms, but that’s still a good trade-off.

Performance Analysis of Hourglass-54

Some experiments were done to investigate the performance contribution of the new Hourglass-54 architecture. We can see predicting the attention maps as a binary classification problem, where the object locations are positives and the rest are negatives. Considering that, the authors propose to measure its accuracy by average precision, denoted as APatt. Hourglass-54 achieves an APatt of 42.7%, while Hourglass-104 achieves 40.1%, suggesting that Hourglass-54 is better at predicting attention maps (Figure 6):

CornerNet-Lite: Efficient Keypoint-Based Object Detection
Figure 6: CornerNet-Saccade with Hourglass-54 produces better results compared to the previous backbone Hourglass-104.

CornerNet-Squeeze-Saccade

After reviewing both methods, you might be wondering, why not merge both of the methods? The truth is that some of the experiments show that combining CornerNet-Squeeze with saccades does not outperform CornerNet-Squeeze.

On the validation set, CornerNet-Squeeze achieves an AP of 34.4%, while CornerNet-Squeeze-Saccade achieves 32.7% (Figure 7). To see how saccade impacts the accuracy the authors replace the predicted attention map with the ground-truth. That improves the AP of CornerNet-Squeeze-Saccade to 38.0%, outperforming CornerNet-Squeeze. The results suggest that saccade can only help if the attention maps are sufficiently accurate. Due to its architecture, CornerNet-Squeeze-Saccade does not have enough capacity to detect objects and predict accurate attention maps simultaneously.

CornerNet-Lite: Efficient Keypoint-Based Object Detection

Figure 7: Comparison between CornerNet-Squeeze-Saccade and Cornernet-Squeeze on MS COCO validation dataset.

CornerNet-Lite versus others on MS COCO

Last, but not the least – CornerNet-Lite results on MS COCO test set (Figure 8). CornerNet-Squeeze is faster and more accurate than YOLOv3. CornerNet-Saccade is more accurate than CornerNet at multi-scales and 6 times faster. What an achievement!
CornerNet-Lite: Efficient Keypoint-Based Object Detection
Figure 8: CornerNet-Lite versus CornerNet and YOLOv3 on MS COCO test set.

CornerNet-Lite Code

Below is the link to the repository with the publicly available code from the authors:

代码链接