胡捷, 赵海涛. 基于掩码一致性机制的弱监督图像语义分割研究[J]. 应用光学, 2024, 45(4): 741-750. DOI: 10.5768/JAO202445.0402003
引用本文: 胡捷, 赵海涛. 基于掩码一致性机制的弱监督图像语义分割研究[J]. 应用光学, 2024, 45(4): 741-750. DOI: 10.5768/JAO202445.0402003
HU Jie, ZHAO Haitao. Weakly supervised image semantic segmentation based on masked consistency mechanism[J]. Journal of Applied Optics, 2024, 45(4): 741-750. DOI: 10.5768/JAO202445.0402003
Citation: HU Jie, ZHAO Haitao. Weakly supervised image semantic segmentation based on masked consistency mechanism[J]. Journal of Applied Optics, 2024, 45(4): 741-750. DOI: 10.5768/JAO202445.0402003

基于掩码一致性机制的弱监督图像语义分割研究

Weakly supervised image semantic segmentation based on masked consistency mechanism

  • 摘要: 语义分割是一项广泛应用于无人驾驶、缺陷检测等场景的计算机视觉技术,但像素级的细粒度标注需要极大的标注成本,所以如何利用易获取的图像级标签进行弱监督语义分割是长期以来的研究重点。相较于仅依靠类激活映射图(class activation maps, CAM)实现像素级分割,提出掩码一致性机制(masked consistency mechanism, MCM)来提供额外的监督信号,以此来缩小全监督和弱监督之间的差距。在全监督语义分割中,网络对图像每一块的掩码预测都具有一致的像素级分割监督,因此在ViT(vision transformer)中屏蔽掉一部分图像块,并要求仅依靠保留的图像块生成的类激活映射图与依靠完整图像生成的类激活映射图一致,以此为网络训练提供额外的自监督信号。在PASCAL VOC 2012和MS COCO上进行的实验表明,本文方法在使用相同监督水平的情况下优于最先进的方法。

     

    Abstract: Semantic segmentation is a computer vision technology widely used in scenarios such as unmanned driving and defect detection, but the fine-grained annotation at the pixel level requires a huge annotation cost. Therefore, how to use the easily obtained image-level labels for weakly supervised semantic segmentation is the focus of long-standing research. Compared with pixel-level segmentation based on a class activation maps (CAM), a masked consistency mechanism (MCM) was proposed to provide additional supervision signals to narrow the gap between full supervision and weakly supervision. In the fully supervised semantic segmentation, the network had consistent pixel-level segmentation supervision for mask prediction of each patch of the image, so some patches were masked out in vision transformer (ViT) and it was required that the CAMs generated by the retained patches should be consistent with the CAMs generated by the complete images to provide additional self-supervision signals for network training. Experiments on PASCAL VOC 2012 and MS COCO show that the proposed method is superior to the most advanced method using the same level of supervision.

     

/

返回文章
返回