Learning bi-grained cross-correlation siamese networks for visual tracking

摘要

Siamese network based trackers measure the similarity between a target template and a search region by computing their cross-correlation. Specifically, Siamese trackers regard the target template as a spatial filter to convolve the search region, putting emphasis on the coarse-grained semantic abstraction of the target in the spatial domain. Along with the demonstrated success of Siamese trackers, little attention has been paid to fine-grained spatial details in cross-correlation computation, which is crucial to precise target localization. In this paper, we propose to learn point-wise cross-correlation Siamese networks for visual tracking. By sketching the contour of the target, the proposed point-wise cross-correlation module helps Siamese networks to be aware of the distinctive boundary between the target and background. In conjunction with traditional depth-wise cross-correlation, the proposed Siamese network takes both advantages of coarse-grained semantic abstraction and fine-grained details to precisely locate the target. Extensive experiments demonstrate the effectiveness and efficiency of the proposed tracker, which achieves new state-of-the-art results on five visual tracking benchmarks including VOT2016, VOT2018, VOT2019, OTB100, and LaSOT with the speed of 38 FPS. As an extra benefit, our tracker can output the segmentation mask for the target. We demonstrate the favorable performance of our tracker on the video object segmentation datasets in comparison with the state-of-the-art.