Entropy, Vol. 27, Pages 392: GBsim: A Robust GCN-BERT Approach for Cross-Architecture Binary Code Similarity Analysis


Entropy, Vol. 27, Pages 392: GBsim: A Robust GCN-BERT Approach for Cross-Architecture Binary Code Similarity Analysis

Entropy doi: 10.3390/e27040392

Authors:
Jiang Du
Qiang Wei
Yisen Wang
Xingyu Bai

Recent advances in graph neural networks have transformed structural pattern learning in domains ranging from social network analysis to biomolecular modeling. Nevertheless, practical deployments in mission-critical scenarios such as binary code similarity detection face two fundamental obstacles: first, the inherent noise in graph construction processes exemplified by incomplete control flow edges during binary function recovery; second, the substantial distribution discrepancies caused by cross-architecture instruction set variations. Conventional GNN architectures demonstrate severe performance degradation under such low signal-to-noise ratio conditions and cross-domain operational environments, particularly in security-sensitive vulnerability identification tasks where feature instability or domain shifts could trigger critical false judgments. To address these challenges, we propose GBsim, a novel approach that combines graph neural networks with natural language processing. GBsim employs a cross-architecture language model to transform binary functions into semantic graphs, leverages a multilayer GCN for structural feature extraction, and employs a Transformer layer to integrate semantic information, generates robust cross-architecture embeddings that maintain high performance despite significant distribution shifts. Extensive experiments on a large-scale cross-architecture dataset show that GBsim achieves an MRR of 0.901 and a Recall@1 of 0.831, outperforming state-of-the-art methods. In real-world vulnerability detection tasks, GBsim achieves an average recall rate of 81.3% on a 1-day vulnerability dataset, demonstrating its practical effectiveness in identifying security threats and outperforming existing methods by 2.1%. This performance advantage stems from GBsim’s ability to maximize information preservation across architectural boundaries, enhancing model robustness in the presence of noise and distribution shifts.



Source link

Jiang Du www.mdpi.com