Diagnostics, Vol. 16, Pages 113: Concordance Between the Multidisciplinary Team and ChatGPT-4o Decisions: A Blinded, Cross-Sectional Concordance Study in Systemic Autoimmune Rheumatic Diseases


Diagnostics, Vol. 16, Pages 113: Concordance Between the Multidisciplinary Team and ChatGPT-4o Decisions: A Blinded, Cross-Sectional Concordance Study in Systemic Autoimmune Rheumatic Diseases

Diagnostics doi: 10.3390/diagnostics16010113

Authors:
Firdevs Ulutaş
Göksel Altınışık
Gülay Güngör
Vefa Çakmak
Nilüfer Yiğit
Duygu Herek
Murat Yiğit
Uğur Karasu
Veli Çobankara

Background/Objective: In recent years, artificial intelligence (AI) has gained increasing prominence in the fields of diagnostic decision-making in medicine. The aim of this study was to compare multidisciplinary team (MDT: rheumatology, pulmonology, thoracic radiology) decisions with single-session plans generated by ChatGPT-4o. Methods: In this cross-sectional concordance study, adults (≥18 years) with confirmed systemic autoimmune rheumatic disease (SARD) and having MDT decisions within the last 6 months were included. The study documented diagnostic, treatment, and monitoring decisions in cases of SARDs by recording answers to six essential questions: (1) What is the most likely clinical diagnosis? (2) What is the most likely radiological diagnosis? (3) Is there a need for anti-inflammatory treatment? (4) Is there a need for antifibrotic treatment? (5) Is drug-free follow-up appropriate? and (6) Are additional investigations required? Consequently, all evaluations were performed with ChatGPT-4o in a single-session format using a standardized single-prompt template, with the system blinded to MDT decisions. All data analyses in this study were conducted using the R programming language (version 4.3.2). An agreement between AI-generated and MDT decisions was assessed using Cohen’s Kappa (κ) statistic where κ (kappa) values represent the level of agreement: <0.20 = slight, 0.21–0.40 = fair, 0.41–0.60 = moderate, 0.61–0.80 = substantial, >0.80 = almost perfect agreement. These analyses were performed using the irr and psych packages in R. Statistical significance of the models was evaluated through p-values, while overall model fit was assessed using the Likelihood Ratio Test. Results: A total of 47 patients were involved in this study, with a predominance of female patients (61.70%, n = 29). The mean age was 61.74 ± 10.40 years. The most frequently observed diagnosis was rheumatoid arthritis (RA), accounting for 31.91% of cases (n = 15). This was followed by cases of anti-neutrophil cytoplasmic antibody (ANCA)-associated vasculitis, interstitial pneumonia with autoimmune features (IPAF), and sarcoidosis. The analyses indicate a statistically significant level of agreement across all decision types. For clinical diagnosis decisions, agreement was moderate (κ = 0.52), suggesting that the AI system can reach partially consistent conclusions in diagnostic processes. The need for an immunosuppressive treatment and follow-up without medication decisions demonstrated a higher level of concordance, reaching the moderate-to-high range (κ = 0.64 and κ = 0.67, respectively). For antifibrotic treatment decisions, agreement was moderate (κ = 0.49), while radiological diagnosis decisions also fell within the moderate range (κ = 0.55). The lowest agreement—though still moderate—was observed in further investigation required decisions (κ = 0.45). Conclusions: In patients with SARDs with pulmonary involvement, particularly in complex cases, concordance was observed between MDT decisions and AI-generated recommendations regarding prioritization of clinical and radiologic diagnoses, treatment selection, suitability for drug-free follow-up, and the need for further diagnostic investigations.



Source link

Firdevs Ulutaş www.mdpi.com