High Performance Visual Tracking with Siamese Region Proposal Network 阅读笔记

High Performance Visual Tracking with Siamese Region Proposal Network 阅读笔记

1,(IDEA) In tracking task we don’t have pre-defined categories, so we need the template branch to encode the target’s appearance information into the RPN feature map to discriminate foreground from background.

2,(RPN) RPN has many successful applications in detection because of its speed and great performance, however, it hasn’t been fully exploited in tracking.

3,(NETWORK) we use the modified AlexNet where the groups from conv2 and conv4 are removed.

4,(KERNEL) The template feature maps [φ(z)]cls and [φ(z)]reg are used as kernels.

5,(LOSS) Softmax loss is adopted to supervise the classification branch. Loss for classification is the cross-entropy loss and we adopt smooth L1 loss with normalized coordinates for regression.

6,(DATA) During the training phase, sample pairs are picked from ILSVRC with a random interval and from Youtube-BB continuously. We extract image pairs from VID and Youtube-BB by choosing frames with interval less than 100 and performing further crop procedure

7,(TRAIN) We train Siamese-RPN end-to-end using Stochastic Gradient Descent (SGD) after the Siamese subnetwork being pretrained using Imagenet.

8,(AUGMENTATIONS) Because of the need of training regression branch, some data augmentations are adopted including affine transformation.

9,(SAMPLE) The criterion used in object detection task is adopted here that we use IoU together with two thresholds 0.6 and 0.3.

10,(SAMPLE) We also limit at most 16 positive samples and totally 64 samples from one training pair.

11,(TRICK) The first proposal selection strategy is discarding the bounding boxes generated by the anchors too far away from the center. We only keep the center 7×7 anchors.

12,(TRICK) The second proposal selection strategy is that we use cosine window and scale change penalty to re-rank the proposals’ score to get the best one.

13,(TRICK) After the final bounding box is selected, target size is updated by linear interpolation to keep the shape changing smoothly.

14,(TRAIN) We use a modified AlexNet pretrained from ImageNet with the parameters of the first three convolution layers fixed and only fine-tune the last two convolution layers in Siamese-RPN.

15,(TRAIN) There are totally 50 epoches performed and the learning rate is decreased in log space from 10−2 to 10−6.

16,(PLATFORM) Our experiments are implemented using PyTorch.

17,(ACCURACY) VOT2016 EAO:0.3441,OTB2015 AUC:0.637.

18,(SPEED) 160fps.

High Performance Visual Tracking with Siamese Region Proposal Network 阅读笔记

High Performance Visual Tracking with Siamese Region Proposal Network 阅读笔记