Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study
 
Yazarlar (7)
Makale Türü Özgün Makale (SSCI, AHCI, SCI, SCI-Exp dergilerinde yayınlanan tam makale)
Dergi Adı MEDICINA-LITHUANIA (Q1)
Dergi ISSN 1010-660X Wos Dergi Scopus Dergi
Dergi Tarandığı Indeksler SCI-Expanded
Makale Dili İngilizce Basım Tarihi 07-2025
Cilt / Sayı / Sayfa 61 / 8 / 1–28 DOI 10.3390/medicina61081342
Makale Linki https://doi.org/10.3390/medicina61081342
UAK Araştırma Alanları
Anatomi
Özet
Background and Objectives General-purpose multimodal large language models (LLMs) are increasingly used for medical image interpretation despite lacking clinical validation. This study evaluates the diagnostic reliability of ChatGPT-4o and Claude 2 in photographic assessment of adolescent idiopathic scoliosis (AIS) against radiological standards. This study examines two critical questions: whether families can derive reliable preliminary assessments from LLMs through analysis of clinical photographs and whether LLMs exhibit cognitive fidelity in their visuospatial reasoning capabilities for AIS assessment. Materials and Methods A prospective diagnostic accuracy study (STARD-compliant) analyzed 97 adolescents (74 with AIS and 23 with postural asymmetry). Standardized clinical photographs (nine views/patient) were assessed by two LLMs and two orthopedic residents against reference radiological measurements. Primary outcomes included diagnostic accuracy (sensitivity/specificity), Cobb angle concordance (Lin’s CCC), inter-rater reliability (Cohen’s κ), and measurement agreement (Bland–Altman LoA). Results The LLMs exhibited hazardous diagnostic inaccuracy: ChatGPT misclassified all non-AIS cases (specificity 0% [95% CI: 0.0–14.8]), while Claude 2 generated 78.3% false positives. Systematic measurement errors exceeded clinical tolerance: ChatGPT overestimated thoracic curves by +10.74° (LoA: −21.45° to +42.92°), exceeding tolerance by >800%. Both LLMs showed inverse biomechanical concordance in thoracolumbar curves (CCC ≤ −0.106). Inter-rater reliability fell below random chance (ChatGPT κ = −0.039 …
Anahtar Kelimeler
adolescent | scoliosis | artificial intelligence | neural networks | diagnostic errors | clinical competence | photography
BM Sürdürülebilir Kalkınma Amaçları
Atıf Sayıları
Web of Science 2
Google Scholar 4
Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study

Paylaş