The overall structure of EGSDK-Net is shown in
Figure 2. First, we utilize the backbone composed of convolutional layers and RTFormer blocks proposed in RTFormer [
27] to obtain the basic feature maps. Since RTFormer [
27] employs a dual-branch module starting from stage three, the top layer of the backbone outputs low-resolution and high-resolution feature maps, denoted as
and
, respectively. We denote the output of the second stage as
. Since
has a high resolution and is rich in detailed information, we use it as the input for the real-time edge guidance module (RTEGM). The edge feature map extracted and supervised by the RTEGM subsequently affects
through weighted application. The reason for choosing
instead of
is that
retains a significant amount of semantic information but loses many detailed cues. Enhancing its discrimination of target areas using the edge feature map will be more effective. The edge-guided
is then enhanced through DAPPM [
28] to improve its ability to perceive multi-scale objects. After upsampling, the output of this process is combined with
and processed through an additional convolutional module to mitigate aliasing effects, yielding the feature map
F. The stage that follows the application of the edge feature map represents the neck of the entire model. The output
F from the neck is then fed into the head, which consists of the stepwise dual kernel update module (SDKUM), to obtain the model’s mask predictions and class probability predictions. Considering that the kernel convolves the features to create mask predictions, additional supervision for the feature map
F is required to ensure accurate mask predictions. Following [
5], we use the auxiliary loss function as follows:
where , , and are the balancing factors used in the auxiliary loss function to balance , , , which is consistent with the design in RT-K-Net. represents the mask-ID cross-entropy loss, denotes the cross-entropy loss, and is the contrastive loss function introduced by RT-K-Net [5]. The overall loss function of EGSDK-Net can be formulated as follows:
where , , , and are also the balancing factors for the overall loss function. , , are consistent with the design in RT-K-Net, while is set by us. refers to the binary cross-entropy loss, represents the dice loss, denotes the focal loss, and utilizes the balanced cross-entropy loss function. The applications of , , and are consistent with those in K-Net [2] and RT-K-Net [5], to improve the guidance of stuff masks and thing masks during training. Moreover, we adopt the same training and inference optimizations, post-processing methods, and instance-based cropping augmentation as RT-K-Net [5]. Please refer to [5] for more information on these steps.