Author Identifier (ORCID)

Seyed Mohammad Jafar Jalali: https://orcid.org/0000-0002-2169-4350

Abstract

Weakly Supervised Semantic Segmentation (WSSS) is a challenging task in computer vision, as it relies on limited supervision to generate precise object localization maps, often using Class Activation Maps (CAMs). Traditional methods struggle with balancing localization accuracy and scalability due to their reliance on fixed network architectures and handcrafted strategies. Neural Architecture Search (NAS), despite its proven success in optimizing network designs across tasks, has not yet been explored in WSSS due to the need for efficient weight sharing. To address these limitations, we propose WEViT, a novel framework that integrates NAS with transformers to optimize network architectures and generate accurate and class-specific object localization maps for WSSS. Our approach leverages the weight entanglement strategy, enabling the supernet to train multiple subnets simultaneously while ensuring high-quality weight inheritance. This eliminates the need for retraining subnets from scratch, significantly reducing computational cost. The best-performing architecture, obtained through the evolutionary algorithm, is then utilized to extract attention weights from transformer heads. These weights are further refined using a Refinement Patch Affinity strategy, effectively removing background noise and enhancing focus on relevant classes in multi-class images. We also incorporate a regularization loss function during training to enhance the generation of class-discriminative localization maps, with experiments highlighting the critical role of transformer layer selection in this process. WEViT achieves state-of-the-art performance on PASCAL VOC 2012 and MS COCO, demonstrating the efficacy of applying NAS to WSSS for the first time and paving the way for scalable, efficient, and accurate segmentation solutions.

Keywords

Class activation mapping, class token, evolutionary, neural architecture search, object localization, search space, transformer, weakly supervised semantic segmentation

Document Type

Journal Article

Date of Publication

8-1-2026

Volume

200

PubMed ID

41807900

Publication Title

Neural Networks

Publisher

Elsevier

School

School of Science

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Comments

Saeedizadeh, N., Jalali, S. M. J., Khan, B., & Mohamed, S. (2026). WEViT: Weight-entangled vision transformers with class-specific attention for weakly supervised semantic segmentation. Neural Networks, 200, 108768. https://doi.org/10.1016/j.neunet.2026.108768

Share

 
COinS
 

Link to publisher version (DOI)

10.1016/j.neunet.2026.108768