Zhang, Gongbo; Jin, Qiao; Zhou, Yiliang; Wang, Song; Idnay, Betina; Luo, Yiming; Park, Elizabeth; Nestor, Jordan G.; Spotnitz, Matthew E.; Soroush, Ali; Campion Jr., Thomas R.; Lu, Zhiyong; Weng, Chunhua; Peng, Yifan

doi:10.1038/s41746-024-01239-w

Back to matches

Your institution may have rights to this item. Sign in to continue.

Title: Closing the gap between open source and commercial large language models for medical evidence summarization.
Authors: Zhang, Gongbo; Jin, Qiao; Zhou, Yiliang; Wang, Song; Idnay, Betina; Luo, Yiming; Park, Elizabeth; Nestor, Jordan G.; Spotnitz, Matthew E.; Soroush, Ali; Campion Jr., Thomas R.; Lu, Zhiyong; Weng, Chunhua; Peng, Yifan
Abstract: Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to the proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance. Utilizing a benchmark dataset, MedReview, consisting of 8161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the performance of open-source models was all improved after fine-tuning. The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were manifested in both a human evaluation and a larger-scale GPT4-simulated evaluation.
Subjects: COMPUTER simulation; PEARSON correlation (Statistics); RESEARCH funding; T-test (Statistics); TASK performance; PROBABILITY theory; EVALUATION of human services programs; NATURAL language processing; DESCRIPTIVE statistics; MEDICAL databases; ARTIFICIAL neural networks; DEEP learning; COMPUTER networks; EVIDENCE-based medicine; CONFIDENCE intervals; COMPUTER assisted instruction; COMPARATIVE studies
Publication: NPJ Digital Medicine, 2024, Vol 7, Issue 1, p1
ISSN: 2398-6352
Publication type: Article
DOI: 10.1038/s41746-024-01239-w

We found a match

Closing the gap between open source and commercial large language models for medical evidence summarization.

Zhang, Gongbo; Jin, Qiao; Zhou, Yiliang; Wang, Song; Idnay, Betina; Luo, Yiming; Park, Elizabeth; Nestor, Jordan G.; Spotnitz, Matthew E.; Soroush, Ali; Campion Jr., Thomas R.; Lu, Zhiyong; Weng, Chunhua; Peng, Yifan

NPJ Digital Medicine, 2024, Vol 7, Issue 1, p1

2398-6352

Article

10.1038/s41746-024-01239-w