Edge-Guided Stepwise Dual Kernel Update Network for Panoptic Segmentation


The overall structure of EGSDK-Net is shown in Figure 2. First, we utilize the backbone composed of convolutional layers and RTFormer blocks proposed in RTFormer [27] to obtain the basic feature maps. Since RTFormer [27] employs a dual-branch module starting from stage three, the top layer of the backbone outputs low-resolution and high-resolution feature maps, denoted as R 5 l o w and R 5 h i g h , respectively. We denote the output of the second stage as R 2 . Since R 2 has a high resolution and is rich in detailed information, we use it as the input for the real-time edge guidance module (RTEGM). The edge feature map extracted and supervised by the RTEGM subsequently affects R 5 l o w through weighted application. The reason for choosing R 5 l o w instead of R 5 h i g h is that R 5 l o w retains a significant amount of semantic information but loses many detailed cues. Enhancing its discrimination of target areas using the edge feature map will be more effective. The edge-guided R 5 l o w is then enhanced through DAPPM [28] to improve its ability to perceive multi-scale objects. After upsampling, the output of this process is combined with R 5 h i g h and processed through an additional convolutional module to mitigate aliasing effects, yielding the feature map F. The stage that follows the application of the edge feature map represents the neck of the entire model. The output F from the neck is then fed into the head, which consists of the stepwise dual kernel update module (SDKUM), to obtain the model’s mask predictions and class probability predictions. Considering that the kernel convolves the features to create mask predictions, additional supervision for the feature map F is required to ensure accurate mask predictions. Following [5], we use the auxiliary loss function as follows:

L a u x = ω r a n k L r a n k + ω s e g L s e g + ω d i s c L d i s c ,

where ω r a n k = 0.1 , ω s e g = 1.0 , and ω d i s c = 1.0 are the balancing factors used in the auxiliary loss function to balance L r a n k , L s e g , L d i s c , which is consistent with the design in RT-K-Net. L r a n k represents the mask-ID cross-entropy loss, L s e g denotes the cross-entropy loss, and L d i s c is the contrastive loss function introduced by RT-K-Net [5]. The overall loss function of EGSDK-Net can be formulated as follows:

L t o t a l = ω m a s k L m a s k + ω d i c e L d i c e + ω c l s L c l s + ω e d g e L e d g e + L a u x ,

where ω m a s k = 1.0 , ω d i c e = 4.0 , ω c l s = 2.0 , and ω e d g e = 1.0 are also the balancing factors for the overall loss function. ω m a s k , ω d i c e , ω c l s are consistent with the design in RT-K-Net, while ω e d g e is set by us. L m a s k refers to the binary cross-entropy loss, L d i c e represents the dice loss, L c l s denotes the focal loss, and L e d g e utilizes the balanced cross-entropy loss function. The applications of L m a s k , L d i c e , and L c l s are consistent with those in K-Net [2] and RT-K-Net [5], to improve the guidance of stuff masks and thing masks during training. Moreover, we adopt the same training and inference optimizations, post-processing methods, and instance-based cropping augmentation as RT-K-Net [5]. Please refer to [5] for more information on these steps.



Source link

Pengyu Mu www.mdpi.com