JCM, Vol. 15, Pages 117: AI Decision-Making Performance in Maternal–Fetal Medicine: Comparison of ChatGPT-4, Gemini, and Human Specialists in a Cross-Sectional Case-Based Study


JCM, Vol. 15, Pages 117: AI Decision-Making Performance in Maternal–Fetal Medicine: Comparison of ChatGPT-4, Gemini, and Human Specialists in a Cross-Sectional Case-Based Study

Journal of Clinical Medicine doi: 10.3390/jcm15010117

Authors:
Matan Friedman
Amit Slouk
Noa Gonen
Laura Guzy
Yael Ganor Paz
Kira Nahum Sacks
Amihai Rottenstreich
Eran Weiner
Ohad Gluck
Ilia Kleiner

Background/Objectives: Large Language Models (LLMs), including ChatGPT-4 and Gemini, are increasingly incorporated into clinical care; however, their reliability within maternal–fetal medicine (MFM), a high-risk field in which diagnostic and management errors may affect both the pregnant patient and the fetus, remains uncertain. Evaluating the alignment of AI-generated case management recommendations with those of MFM specialists, emphasizing accuracy, agreement, and clinical relevancy. Study Design and Setting: Cross-sectional study with blinded online evaluation (November–December 2024); evaluators were blinded to responder identity (AI vs. human), and case order and response labels were randomized for each evaluator using a computer-generated sequence to reduce order and identification bias. Methods: Twenty hypothetical MFM cases were constructed, allowing standardized presentation of complex scenarios without patient-identifiable data and enabling consistent comparison of AI-generated and human specialist recommendations. Responses were generated by ChatGPT-4, Gemini, and three MFM specialists, then assessed by 22 blinded board-certified MFM evaluators using a 10-point Likert scale. Agreement was measured with Spearman’s rho (ρ) and Cohen’s (κ); accuracy differences were measured with Wilcoxon signed-rank tests. Results: ChatGPT-4 exhibited moderate alignment (mean 6.6 ± 2.95; ρ = 0.408; κ = 0.232, p < 0.001), performing well in routine, guideline-driven scenarios (e.g., term oligohydramnios, well-controlled gestational hypertension, GDMA1). Gemini scored 7.0 ± 2.64, demonstrating effectively no consistent inter-rater agreement (κ = −0.024, p = 0.352), indicating that although mean scores were slightly higher, evaluators varied widely in how they judged individual Gemini responses. No significant difference was found between ChatGPT-4 and clinicians in median accuracy scores (Wilcoxon p = 0.18), while Gemini showed significantly lower accuracy (p < 0.01). Model performance varied primarily by case complexity: agreement was higher in straightforward, guideline-based scenarios and more variable in complex cases, whereas no consistent pattern was observed by gestational age or specific clinical domain across the 20 cases. Conclusions: AI shows promise in routine MFM decision-making but remains constrained in complex cases, where models sometimes under-prioritize maternal–fetal risk trade-offs or incompletely address alternative management pathways, warranting cautious integration into clinical practice. Generalizability is limited by the small number of simulated cases and the use of hypothetical vignettes rather than real-world clinical encounters.



Source link

Matan Friedman www.mdpi.com