4.1. Dataset Description and Experimental Design
- 1.
SeaDronesSee dataset and MOBDrone dataset
The SeaDronesSee dataset serves as a benchmark for visual object detection and tracking, aiming to connect the capabilities of land-based and sea-based vision systems. The dataset contains over 54,000 frames and approximately 400,000 instances, captured by UAVs at altitudes ranging from 5 to 260 m and angles from 0 to 90 degrees. It supports multi-object and multi-scale detection. While the dataset is rich, it also increases the detection difficulty, impacting the performance of detection models. The dataset used in this study includes 5630 images and 24,131 targets, with re-labeled imbalanced samples, maintaining the original dataset partition used in the competition. The training set includes 2975 images (52.9%), the validation set comprises 859 images (15.3%), and the test set includes 1796 images (31.9%) to ensure a fair comparison.
Figure 5 illustrates the quantity and size of each category.
Figure 5a presents a bar chart showing the number of samples for each category (e.g., swimmer, floater, boat).
Figure 5b displays a scatter density plot of the target size distribution, where the data points are concentrated in the lower-left corner, indicating that most targets in the dataset have small widths and heights typical of small object detection.
The MOBDrone dataset is a large-scale collection of aerial imagery focused on maritime environments, specifically capturing images under varying heights, camera angles, and lighting conditions. This dataset comprises 126,170 images extracted from 66 video clips recorded by a UAV flying at altitudes ranging from 10 to 60 m. The images are annotated with over 180,000 bounding boxes to label five object categories: humans, boats, lifebuoys, surfboards, and wood. Particular emphasis is placed on annotating humans in water to simulate scenarios requiring rescue operations for individuals overboard. To ensure the dataset’s diversity and representativeness, 8000 images were randomly sampled for experiments in this study. The chosen samples were divided into training, validation, and testing subsets in a ratio of 7:2:1, which is conducive to the construction and assessment of models tailored for maritime applications.
- 2.
Experimental Setup and Configuration
The CFSD-UAVNet method was implemented using the PyTorch framework. To run the algorithm, a deep learning environment was set up on an Ubuntu 18.04 system, which includes a CPU with 15 vCPUs (AMD EPYC 7642 48-Core Processor), 80 GB of memory, and an RTX 3090 GPU (24 GB) with 24 GB of VRAM. Other experimental settings are shown in
Table 1.
- 3.
Evaluation Metrics
The evaluation metrics in this study include standard object detection benchmarks to assess the performance of the CFSD-UAVNet. Key metrices are as follows.
Precision (P): Reflects the proportion of correctly identified positive samples out of all predicted positive samples calculated using true positives (TPs) and false positives (FPs), as expressed in the following formula:
Recall (R): Indicates the ratio of true positives to the total number of actual positive samples derived from TPs and false negatives (FNs), as expressed in the following formula:
Average Precision (AP): Represents the mean detection accuracy for a specific class, calculated as the area under the Precision–Recall (P-R) curve. A higher AP value signifies superior detection performance. It is expressed by the following formula:
Mean Average Precision (mAP): Aggregates the AP across all classes, measuring the model’s overall detection accuracy. It is expressed by the following formula:
Additional practical metrics include the following.
Floating-Point Operations (FLOPs): Quantifies a model’s computational complexity by calculating the total number of floating-point operations performed.
Parameter Count (Param): Indicates the total parameters in the model, reflecting its size and efficiency.
Detection Speed: In object detection, it refers to the system’s ability to identify and locate targets within a given time unit.
4.2. Model Performance Evaluation
4.2.1. Effectiveness of Attention in the CRE
To assess the performance of the EMA’s attention in the proposed CRE module, various attention mechanisms, including SE, CBAM [
32], SimAM [
33], and GAM [
34], were also selected. The results of the experiments are presented in
Table 2.
It is evident that various attention mechanisms have varying impacts on the CRE module. In terms of key performance indicators, EMA consistently demonstrates positive performance; its mAP@50 reaches 71.5%, and its mAP@95 achieves 42.4%, both of which are the highest among all tested attention mechanisms. This result not only highlights EMA’s precision in detecting small objects in detection tasks but also reflects its robustness in handling complex scenarios.
A further analysis of EMA’s performance reveals that it has a precision (P) of 80.8% and a recall (R) of 63.9%. In terms of model lightweighting, EMA also performs well. Its parameter count (Params) and floating-point operations (FLOPs) are relatively low. Compared with SE, CBAM, SimAM, and GAM, the advantages of EMA become even more apparent. Although these mechanisms each have their unique features, none outperform EMA in overall performance. EMA, through its distinctive channel and spatial feature recalibration strategy, effectively enhances feature representation, improving the model’s capacity to detect and recognize small objects. Consequently, EMA was ultimately chosen as the attention mechanism for the CRE module.
4.2.2. Effectiveness of Loss Function
To assess the performance of different loss functions, five different loss functions with distinct functionalities were selected for detailed evaluation: CIoU, EIoU [
35], SIoU [
36], DIoU, and WIoU. The primary goal of this study is to evaluate how loss functions improve detection accuracy and precision in maritime object detection. The experimental findings are summarized in
Table 3.
From the results, it is evident that the WIoU loss function outperforms the other loss functions in terms of accuracy across each category. Among them, the WIoUv2 version demonstrated relatively good improvement, achieving an mAP@50 of 70.6%. This indicates that the WIoUv2 loss function is suitable for application on the SeaDronesSee dataset, with good performance in small object detection in maritime environments. By more accurately capturing target features, the WIoUv2 loss function effectively enhances detection accuracy while reducing the risks of incorrect positives and incorrect negatives.
4.4. Ablation Experiment
To assess the performance of the modules introduced in the model, ablation experiments were executed on the SeaDronesSee dataset using YOLOv8n as the baseline algorithm. The result from the ablation experiments are illustrated in
Table 4 for various modules, enabling a detailed comparison of each module’s contribution to the model’s performance.
According to the data in
Table 4, the incorporation of the PHead and CRE modules mainly enhanced the model’s detection accuracy while decreasing the number of parameters. Additionally, both precision and recall were increased. The incorporation of the CRE and CED modules decreased the parameter count, while the average precision improved by 3.5% and 1.2%, respectively. Similarly, the addition of the WIoUv2 loss function helped balance positive and negative samples, thereby enhancing the model’s robustness. These experimental results show the effectiveness of these modules in reducing the model’s parameter count while ensuring the precision of maritime small object detection. This enables the maritime rescue system to process data efficiently.
Table 5 presents the ablation experiment results for the stepwise integration of modules. It is evident that adding the CRE module to the PHead effectively merges the efficient Vision Transformer architecture with lightweight CNN [
37]. This integration lower the parameter count while increasing the model’s mAP@50 by 2.3%. Building on this, the introduction of the CED module further enhances performance, with mAP@50 increasing by 0.8% and mAP@95 by 0.5%. Additionally, the recall (R) metric shows an improvement of 3.1%, while both the model’s parameters are reduced. Finally, the inclusion of the WIoUv2 loss function demonstrates its ability to improve performance without increasing computational complexity or the model size. This indicates that WIoUv2 helps the network detect small objects in complex environments, thus enhancing the model’s ability to generalize and its accuracy.
In summary, the CFSD-UAVNet model achieved an mAP@50 of 80.1% and an mAP@95 of 46.3%, representing improvements of 12.1% and 7.2% over the baseline algorithm. Additionally, precision and recall improved by 5.4% and 16.7%, over the baseline, indicating that the model is accurate in classification and better at balancing false positives and false negatives, particularly in small object detection. Furthermore, the parameter count was reduced, with 1.6 M fewer parameters compared to the baseline algorithm, allowing for high detection accuracy with reduced computational resources. Therefore, the CFSD-UAVNet model improves detection performance while simultaneously lowering the number of parameters, thereby improving its applicability in resource-constrained environments.
Figure 7 illustrates the comparison of precision, recall,
[email protected], and
[email protected] for the ablation experiment in the UAV-based maritime small object detection task. As shown in
Figure 7a, the precision of CFSD-UAVNet is consistently higher than that of other methods throughout the training process, especially in the early stages.
Figure 7b displays the recall comparison, where CFSD-UAVNet consistently outperforms other experiments, with a gap emerging in the later stages of training. In
Figure 7c, upon comparing
[email protected], CFSD-UAVNet initially lags behind some methods but rapidly surpasses them and maintains a leading position thereafter.
Figure 7d shows the
[email protected] comparison, where CFSD-UAVNet also demonstrates good performance, surpassing other methods after the mid-training phase. By incorporating improvements like PHead, CRE, and CED, CFSD-UAVNet outperforms all metrics, validating the proposed strategies. Compared to the baseline, CFSD-UAVNet achieves higher precision and recall earlier in training and maintains stable performance, demonstrating improved detection accuracy and robustness for UAV-based maritime small object detection.
The ablation experiment results comparing the baseline and CFSD-UAVNet models on the SeaDronesSee dataset are visualized in
Figure 8. In
Figure 8a, the CFSD-UAVNet algorithm demonstrates good performance under the challenging task of multi-class object recognition, accurately distinguishing various targets with enhanced recognition accuracy and robustness. In contrast, the baseline model exhibits missed detections. In the high-altitude images shown in
Figure 8b, CFSD-UAVNet can handle the dual challenges of complex sea surface textures and blurred small target edges, precisely capturing the floater category, with an improvement in recognition accuracy, while the baseline fails to detect the floater category. In the UAV perspective images presented in
Figure 8c, the baseline incorrectly classifies the floater category as swimmer, whereas CFSD-UAVNet correctly identifies all targets without false positives or missed detections, showcasing its recognition capability for specific targets. Combining the results from
Figure 8a–c, it is evident that CFSD-UAVNet excels across varying heights and image clarity conditions, demonstrating its performance and adaptability in small object detection in complex maritime environments. However, further investigation is required to evaluate its performance under varying lighting and weather conditions.
4.5. Comparison Experiment on SeaDronesSee Dataset
To evaluate the relative impact of the attention mechanism within the proposed CRE module, we visualized the detection results of five models, YOLOv3, YOLOv3-tiny, YOLOv5, YOLOv8, and CFSD-UAVNe, using heatmaps, as shown in
Figure 9. Specifically,
Figure 9a represents the ground truth,
Figure 9b is the heatmap of YOLOv3-tiny,
Figure 9c is the heatmap of YOLOv5,
Figure 9d is the heatmap of YOLOv8,
Figure 9e is the heatmap of YOLOv3, and
Figure 9f is the heatmap of CFSD-UAVNet. In these heatmaps, the intensity of the color indicates the model’s level of attention to specific regions of the image during object detection. It can be observed that YOLOv3-tiny, YOLOv5, YOLOv3, and YOLOv8 exhibit little or no attention to the two floater-class objects within the red bounding box in the heatmaps. In contrast, CFSD-UAVNet accurately focuses on these small maritime targets. Additionally, as illustrated in
Figure 9b, YOLOv3-tiny erroneously directs attention to background regions of the sea (in the yellow box), whereas CFSD-UAVNet effectively mitigates such background interference. The integration of the EMA attention mechanism enhances the representation of object features, thereby improving the model’s perceptual capability for these targets.
To evaluate the detectability of the CFSD-UAVNet algorithm, it was compared with the latest mainstream algorithms on the SeaDronesSee dataset, including Faster R-CNN, DETR [
36], YOLOv3, YOLOv3-tiny, YOLOv5, YOLOv6, YOLOX, and YOLOv8. The evaluation metrics selected for comparison included mAP@50, mAP@95, the parameter count, the computational complexity, the recall rate, and the detection speed. For each method, the experimental settings remained largely consistent, with the same dataset and scale, ensuring fairness in the comparison. The comparative experiment results in
Table 6 highlight that CFSD-UAVNet delivers strong performance on this dataset, combining low computational complexity and parameter count while outperforming other methods in accuracy and overall effectiveness.
Figure 10 presents the comparison of
[email protected] and
[email protected] during the training process for YOLOv3-tiny, YOLOv3, YOLOv5, YOLOX, YOLOv8, and the proposed CFSD-UAVNet algorithm. As shown in
Figure 10a, in the
[email protected] comparison, the CFSD-UAVNet curve consistently outperforms the other methods, especially during the early stages of training, showing its efficiency in recognizing small targets.
Figure 10b shows the
[email protected] comparison, where CFSD-UAVNet maintains a leading position throughout, with the gap widening during the later stages of training, indicating its advantage in high-precision object detection.
Combining both figures, it is evident that CFSD-UAVNet consistently maintains a high Map value throughout training, excelling in both lower threshold generalization (
[email protected]) and higher-precision target recognition (
[email protected]). This demonstrates the superiority of CFSD-UAVNet in UAV-based maritime small object detection tasks, where it ensures high detection accuracy while providing faster convergence and higher detection quality. Compared to traditional YOLO-based algorithms, CFSD-UAVNet shows improvements in both
[email protected] and
[email protected], reflecting that our modifications effectively enhance the model’s detection capabilities and stability.
4.6. Comparison Experiment on MOBDrone Dataset
To assess the proposed algorithm’s ability to generalize, comparative tests were run on the MOBDrone dataset, pitting it against leading-edge algorithms. The algorithms and metrics chosen for evaluation mirrored those employed in the SeaDronesSee dataset analysis. As shown in
Table 7, the results demonstrate that among the various algorithms, the CFSD-UAVNet algorithm performed remarkably well on this dataset, surpassing the competing algorithms in terms of small object detection accuracy and overall model capability.
Figure 11a,b, respectively, illustrate the comparison curves of
[email protected] and
[email protected] during the training process for selected algorithms. The x-axis indicates the number of training epochs, and the y-axis shows the detection metrics,
[email protected] and
[email protected]. CFSD-UAVNet surpasses other algorithms in terms of
[email protected]. With the progression of training, the performance of all algorithms steadily increases and approaches a common level. Notably, CFSD-UAVNet achieves higher
[email protected] values after convergence compared to other algorithms, demonstrating its superiority in accuracy. For the more stringent
[email protected] metric, CFSD-UAVNet also performs well, achieving higher final precision than its counterparts. Particularly in the later stages of training, the performance gap between CFSD-UAVNet and other algorithms widens, highlighting its robustness in high-standard detection tasks. Overall, CFSD-UAVNet demonstrates detection accuracy across the two key metrics in UAV-based maritime object detection, outperforming common YOLO series algorithms.
4.7. Analysis of Visualization Results
The visualization outcomes of the comparative experiments conducted on the SeaDronesSee dataset, are presented in
Figure 12.
Figure 12a illustrates a scenario with high brightness, complex textures, and background interference in the marine environment. The image contains one boat and two floaters. While all algorithms successfully detect the boat, Faster-RCNN, YOLOv3-tiny, YOLOv5, and YOLOv8 fail to detect the floaters, likely due to their limited ability to extract key features of small objects in this challenging background. YOLOv3 detects only one floater, highlighting its insufficient capability in local feature extraction for small objects. Although DETR detects the targets, it misclassifies the categories, which may be attributed to its limited discrimination ability for target semantics. In contrast, the CFSD-UAVNet algorithm successfully detects and classifies all expected targets, demonstrating superior detection completeness and accuracy in complex backgrounds.
Figure 12b depicts a low-brightness scenario with a darker sea surface, where blurred target boundaries and low contrast increase the detection challenge, particularly for closely positioned targets. The image includes one boat, two floaters, and two swimmers. DETR, YOLOv3-tiny, and YOLOv5 exhibit category misclassification errors for the floater and swimmer classes, likely due to their weak ability to distinguish blurred target features in low-light conditions. Faster-RCNN shows overlapping bounding boxes for the floater category, which may result from interference in boundary feature extraction by its feature extraction network. YOLOv3 and YOLOv8 detect only one floater, indicating their limitations in detecting weak-feature targets in multi-object scenarios. In contrast, CFSD-UAVNet correctly detects and classifies all targets, further validating its robustness and adaptability to blurred boundaries in low-brightness and complex backgrounds.
In summary, CFSD-UAVNet demonstrates significant advantages in handling complex maritime scenarios and varying lighting conditions. However, other algorithms reveal limitations in feature extraction, target differentiation, and background interference suppression in cases of detection errors.
The comparative experiment visualizations on the MOBDrone dataset are illustrated in
Figure 13.
Figure 13a illustrates small object detection under complex natural backgrounds at high altitudes using UAVs. In the image, the drowning person is located near coastal rocks, with textures similar to those of the rocks and ocean waves, making the detection task highly challenging. Except for the CFSD-UAVNet algorithm, none of the other algorithms successfully detected the drowning person, indicating their insufficient robustness in this scenario and their weaker ability to recognize small objects against such challenging backgrounds.
Figure 13b presents the detection of a distant small object over the open sea. The target’s small size and low visibility against the background compound the challenge of detection. Additionally, due to the wave fluctuations, the target’s edges are blurred, requiring detection algorithms to exhibit strong robustness. Notably, the detection bounding box for the target should be positioned above and around the head of the drowning person. The DETR algorithm’s bounding box, however, is positioned at the hand of the person, while YOLOv8 places the bounding box at the feet of the person, failing to fully detect the individual. This may be attributed to the incomplete feature extraction of small objects by these algorithms. Faster-RCNN, YOLOv3-tiny, and YOLOv5 all fail to detect the target, potentially due to susceptibility to wave interference during distant small object detection. Although YOLOv3 detects the target, it misclassifies the drowning person as a boat, likely due to its insufficient representation of small object features, especially against complex backgrounds where targets and similarly textured regions are easily confused, resulting in misclassification. Only CFSD-UAVNet successfully detects the drowning person, demonstrating its exceptional capabilities in feature extraction and object recognition in complex scenarios.