Thirunavukarasu, Arun James; Mahmood, Shathar; Malem, Andrew; Foster, William Paul; Sanghera, Rohan; Hassan, Refaat; Zhou, Sean; Wong, Shiao Wei; Wong, Yee Ling; Chong, Yu Jeat; Shakeel, Abdullah; Chang, Yin-Hsi; Tan, Benjamin Kye Jyn; Jain, Nikhil; Tan, Ting Fang; Rauz, Saaeha; Ting, Daniel Shu Wei; Ting, Darren Shu Jeng

doi:10.1371/journal.pdig.0000341

Back to matches

Your institution may have rights to this item. Sign in to continue.

Title: Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study.
Authors: Thirunavukarasu, Arun James; Mahmood, Shathar; Malem, Andrew; Foster, William Paul; Sanghera, Rohan; Hassan, Refaat; Zhou, Sean; Wong, Shiao Wei; Wong, Yee Ling; Chong, Yu Jeat; Shakeel, Abdullah; Chang, Yin-Hsi; Tan, Benjamin Kye Jyn; Jain, Nikhil; Tan, Ting Fang; Rauz, Saaeha; Ting, Daniel Shu Wei; Ting, Darren Shu Jeng
Abstract: Large language models (LLMs) underlie remarkable recent advanced in natural language processing, and they are beginning to be applied in clinical contexts. We aimed to evaluate the clinical potential of state-of-the-art LLMs in ophthalmology using a more robust benchmark than raw examination scores. We trialled GPT-3.5 and GPT-4 on 347 ophthalmology questions before GPT-3.5, GPT-4, PaLM 2, LLaMA, expert ophthalmologists, and doctors in training were trialled on a mock examination of 87 questions. Performance was analysed with respect to question subject and type (first order recall and higher order reasoning). Masked ophthalmologists graded the accuracy, relevance, and overall preference of GPT-3.5 and GPT-4 responses to the same questions. The performance of GPT-4 (69%) was superior to GPT-3.5 (48%), LLaMA (32%), and PaLM 2 (56%). GPT-4 compared favourably with expert ophthalmologists (median 76%, range 64–90%), ophthalmology trainees (median 59%, range 57–63%), and unspecialised junior doctors (median 43%, range 41–44%). Low agreement between LLMs and doctors reflected idiosyncratic differences in knowledge and reasoning with overall consistency across subjects and types (p>0.05). All ophthalmologists preferred GPT-4 responses over GPT-3.5 and rated the accuracy and relevance of GPT-4 as higher (p<0.05). LLMs are approaching expert-level knowledge and reasoning skills in ophthalmology. In view of the comparable or superior performance to trainee-grade ophthalmologists and unspecialised junior doctors, state-of-the-art LLMs such as GPT-4 may provide useful medical advice and assistance where access to expert ophthalmologists is limited. Clinical benchmarks provide useful assays of LLM capabilities in healthcare before clinical trials can be designed and conducted. Author summary: Large language models (LLMs) are the most sophisticated form of language-based artificial intelligence. LLMs have the potential to improve healthcare, and experiments and trials are ongoing to explore potential avenues for LLMs to improve patient care. Here, we test state-of-the-art LLMs on challenging questions used to assess the aptitude of eye doctors (ophthalmologists) in the United Kingdom before they can be deemed fully qualified. We compare the performance of these LLMs to fully trained ophthalmologists as well as doctors in training to gauge the aptitude of the LLMs for providing advice to patients about eye health. One of the LLMs, GPT-4, exhibits favourable performance when compared with fully qualified and training ophthalmologists; and comparisons with its predecessor model, GPT-3.5, indicate that this superior performance is due to improved accuracy and relevance of model responses. LLMs are approaching expert-level ophthalmological knowledge and reasoning, and may be useful for providing eye-related advice where access to healthcare professionals is limited. Further research is required to explore potential avenues of clinical deployment.
Subjects: EDUCATION of physicians; MEDICAL logic; CROSS-sectional method; PEARSON correlation (Statistics); OPHTHALMOLOGISTS; T-test (Statistics); RESEARCH funding; DATA analysis; BENCHMARKING (Management); RESEARCH evaluation; HEALTH; FISHER exact test; ARTIFICIAL intelligence; NATURAL language processing; INFORMATION resources; PROFESSIONAL licensure examinations; CHI-squared test; PHYSICIANS' attitudes; OPHTHALMOLOGY; PROFESSIONS; HOSPITAL medical staff; STATISTICS; MEMORY; DATA analysis software; USER interfaces
Publication: PLoS Digital Health, 2024, Vol 3, Issue 4, p1
ISSN: 2767-3170
Publication type: Article
DOI: 10.1371/journal.pdig.0000341

We found a match

Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study.

PLoS Digital Health, 2024, Vol 3, Issue 4, p1

2767-3170

Article

10.1371/journal.pdig.0000341