Exploiting temporal coherence for self-supervised visual tracking by using vision transformer

作者：

Highlights：

•

摘要

Deep learning based fully-supervised visual trackers entail the requirement of large-scale and frame-wise annotation that needs a laborious and tedious data annotation process. To reducing the amount of labeled efforts, a self-supervised learning framework, the ETC, is proposed in this work that exploits temporal coherence as a self-supervised signal and uses visual transformer to capture the relationship among the unlabeled video frames. We design a cycle-consistent transformer architecture to cast self-supervised tracking as cycle prediction problems. With carefully-designed and targeted configurations for cycle-consistent transformer including temporal sampling strategies, tracking initialization and data augmentation, our approach is applicable for two tracking settings, i.e., the unlabeled sample (ULS) scene and the few labeled sample (FLS) scene. To learn richer and more discriminative representations, we not only utilize the inter-frame correspondence, but also conduct the intra-frame correspondence to effectively model the target-to-frame and long-range correspondence. Extensive experiments are conducted on the popular benchmark datasets OTB2015, VOT2018, UAV123, TColor-128, NFS and LaSOT, and the results show that our approach achieves competitive results in the ULS setting, and supplies a trade-off between performance and annotation cost in the FLS setting.

论文关键词：Self-supervised learning,Visual tracking,Temporal coherence,Transformer

论文评审过程：Received 25 December 2021, Revised 26 April 2022, Accepted 21 June 2022, Available online 27 June 2022, Version of Record 4 July 2022.

论文官网地址：https://doi.org/10.1016/j.knosys.2022.109318