UAVformer: A Composite Transformer Network for Urban Scene Segmentation of UAV Images

Highlights：

• A novel transformer-based semantic segmentation network with a composite structure backbone is proposed for urban scene segmentation of UAV images.

• Adaptive fusion modules (AFM) are implemented to adaptively fuse the multi-level extracted features.

• An aggregation window multi-head self-attention (AWMSA) mechanism is designed in the transformer block for accurately segmented scale variation objects in UAV images.

• A V-shaped decoder with the capacity to fully utilise multi-level features is proposed to ideally preserve segmented object boundaries.

摘要

•A novel transformer-based semantic segmentation network with a composite structure backbone is proposed for urban scene segmentation of UAV images.•Adaptive fusion modules (AFM) are implemented to adaptively fuse the multi-level extracted features.•An aggregation window multi-head self-attention (AWMSA) mechanism is designed in the transformer block for accurately segmented scale variation objects in UAV images.•A V-shaped decoder with the capacity to fully utilise multi-level features is proposed to ideally preserve segmented object boundaries.

论文关键词：Urban scenes segmentation,UAV image,Composite backbone,Aggregation windows multi-head self-attention transformer block,V-shaped decoder

论文评审过程：Received 1 December 2021, Revised 26 July 2022, Accepted 4 September 2022, Available online 7 September 2022, Version of Record 12 September 2022.