4G-VOS: Video Object Segmentation using guided context embedding

摘要

Video Object Segmentation (VOS) is a fundamental task required in many high-level real-world computer vision applications. VOS becomes challenging due to the presence of background distractors as well as to object appearance variations. Many existing VOS approaches use online model updates to capture the appearance variations which incurs high computational cost. Template matching and propagation-based VOS methods, although cost-effective, suffer from performance degradation under challenging scenarios such as occlusion and background clutter. In order to tackle these challenges, we propose a network architecture dubbed 4G-VOS to encode video context for improved VOS performance to tackle these challenges. To preserve long term semantic information, we propose a guided transfer embedding module. We employ a global instance matching module to generate similarity maps from the initial image and the mask. Besides, we use a generative directional appearance module to estimate and dynamically update the foreground/background class probabilities in a spherical embedding space. Moreover, during feature refinement, existing approaches may lose contextual information. Therefore, we propose a guided pooled decoder to exploit the global and local contextual information during feature refinement. The proposed framework is an end-to-end learning architecture that is trained in an offline fashion. Evaluations over three VOS benchmark datasets including DAVIS2016, DAVIS2017, and YouTube-VOS have demonstrated outstanding performance of the proposed algorithm compared to 40 existing state-of-the-art methods.