Sensors, Vol. 25, Pages 6163: VLA-MP: A Vision-Language-Action Framework for Multimodal Perception and Physics-Constrained Action Generation in Autonomous Driving
Sensors doi: 10.3390/s25196163
Authors:
Maoning Ge
Kento Ohtani
Yingjie Niu
Yuxiao Zhang
Kazuya Takeda
Autonomous driving in complex real-world environments requires robust perception, reasoning, and physically feasible planning, which remain challenging for current end-to-end approaches. This paper introduces VLA-MP, a unified vision-language-action framework that integrates multimodal Bird’s-Eye View (BEV) perception, vision-language alignment, and a GRU-bicycle dynamics cascade adapter for physics-informed action generation. The system constructs structured environmental representations from RGB images and LiDAR, aligns scene features with natural language instructions through a cross-modal projector and large language model, and converts high-level semantic hidden states outputs into executable and physically consistent trajectories. Experiments on the LMDrive dataset and CARLA simulator demonstrate that VLA-MP achieves high performance across the LangAuto benchmark series, with best driving scores of 44.3, 63.5, and 78.4 on LangAuto, LangAuto-Short, and LangAuto-Tiny, respectively, while maintaining high infraction scores of 0.89–0.95, outperforming recent VLA methods such as LMDrive and AD-H. Visualization and video results further validate the framework’s ability to follow complex language-conditioned instructions, adapt to dynamic environments, and prioritize safety. These findings highlight the potential of combining multimodal perception, language reasoning, and physics-aware adapters for robust and interpretable autonomous driving.
Source link
Maoning Ge www.mdpi.com