Diagnostics, Vol. 16, Pages 313: Can GPT-5.0 Interpret Thyroid Ultrasound Images? A Comparative TI-RADS Analysis with an Expert Radiologist
Diagnostics doi: 10.3390/diagnostics16020313
Authors:
Yunus Yasar
Sevde Nur Emir
Muhammet Rasit Er
Mustafa Demir
Background/Objectives: Multimodal large language models (LLMs) may directly interpret medical images, including thyroid ultrasounds (USs). Whether these models can reliably assess thyroid nodules—where subtle echogenic and morphological details are critical—remains uncertain. The American College of Radiology (ACR) TI-RADS system provides a structured framework for benchmarking artificial intelligence. This study evaluates GPT-5.0’s ability to interpret thyroid US images according to TI-RADS criteria and contextualizes its performance relative to expert radiologist assessment, using FNA cytology as the reference standard. Methods: This retrospective study included 100 patients (mean age 49.8 ± 12.6 years; 72 women) with cytology-confirmed diagnoses: Bethesda II (benign) or Bethesda V–VI (malignant). Each nodule had longitudinal and transverse US images acquired with high-frequency linear probes. A board-certified radiologist (>10 years’ experience) and GPT-5.0 independently assessed TI-RADS features (composition, echogenicity, shape, margin, echogenic foci) and assigned final categories. Agreement was analyzed using Cohen’s κ, and diagnostic performance was calculated using TR4–TR5 as positive for malignancy. Results: Agreement was substantial for composition (κ = 0.62), shape (κ = 0.70), and margin (κ = 0.68); moderate for echogenicity (κ = 0.48); and poor for echogenic foci (κ = 0.12). GPT-5.0 demonstrated a systematic, risk-averse tendency to up-classify nodules, leading to increased TR4–TR5 assignments. Overall, the TI-RADS agreement was 58% (κ = 0.31). The radiologist showed superior diagnostic performance (sensitivity 89%, specificity 85%) compared with GPT-5.0 (sensitivity 67%, specificity 49%), largely driven by false-positive TR4 classifications among benign nodules. Conclusions: GPT-5.0 recognizes several high-level TI-RADS features but struggles with microcalcifications and tends to overestimate malignancy risk within a risk-stratification framework, limiting its standalone clinical use. Ultrasound-specific training and domain adaptation may enable meaningful adjunctive roles in thyroid nodule assessment.
Source link
Yunus Yasar www.mdpi.com
