Mask encoding: A general instance mask representation for object segmentation

Highlights：

• Encoding can be done with a few dictionary learning methods, including principal component analysis (PCA), sparse coding, and auto-encoders. We integrate this mask representation into Mask R-CNN framework with slight modifications to the model architecture. Our method consistently improves mask AP by 0.9% on the COCO dataset, 1.4% on the LVIS dataset, and 2.1% on the Cityscapes dataset.

• With this mask representation, a new framework is proposed for single shot instance segmentation, by extending FCOS with a mask branch for mask coefficient regression. Our mask encoding is completely independent of the mechanism of detectors, and it could be easily incorporated into other object detectors. Our method holds a significant lead in accuracy compared with other explicit contour-based one-stage frameworks.

• The proposed method is seamlessly extended for video instance segmentation across video frames by adding a vanilla track branch, achieving favourable performance on YouTube-VIS dataset.

摘要

•We propose to encode a two-dimensional binary instance mask into a compact representation vector. The compressed vector, takes advantages of the redundancy in the original mask and proves to be effective and efficient for reconstruction.•Encoding can be done with a few dictionary learning methods, including principal component analysis (PCA), sparse coding, and auto-encoders. We integrate this mask representation into Mask R-CNN framework with slight modifications to the model architecture. Our method consistently improves mask AP by 0.9% on the COCO dataset, 1.4% on the LVIS dataset, and 2.1% on the Cityscapes dataset.•With this mask representation, a new framework is proposed for single shot instance segmentation, by extending FCOS with a mask branch for mask coefficient regression. Our mask encoding is completely independent of the mechanism of detectors, and it could be easily incorporated into other object detectors. Our method holds a significant lead in accuracy compared with other explicit contour-based one-stage frameworks.•The proposed method is seamlessly extended for video instance segmentation across video frames by adding a vanilla track branch, achieving favourable performance on YouTube-VIS dataset.