Soong, David; Sridhar, Sriram; Si, Han; Wagner, Jan-Samuel; Sá, Ana Caroline Costa; Yu, Christina Y.; Karagoz, Kubra; Guan, Meijian; Kumar, Sanyam; Hamadeh, Hisham; Higgs, Brandon W.

doi:10.1371/journal.pdig.0000568

Back to matches

Your institution may have rights to this item. Sign in to continue.

Title: Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model.
Authors: Soong, David; Sridhar, Sriram; Si, Han; Wagner, Jan-Samuel; Sá, Ana Caroline Costa; Yu, Christina Y.; Karagoz, Kubra; Guan, Meijian; Kumar, Sanyam; Hamadeh, Hisham; Higgs, Brandon W.
Abstract: Large language models (LLMs) have made a significant impact on the fields of general artificial intelligence. General purpose LLMs exhibit strong logic and reasoning skills and general world knowledge but can sometimes generate misleading results when prompted on specific subject areas. LLMs trained with domain-specific knowledge can reduce the generation of misleading information (i.e. hallucinations) and enhance the precision of LLMs in specialized contexts. Training new LLMs on specific corpora however can be resource intensive. Here we explored the use of a retrieval-augmented generation (RAG) model which we tested on literature specific to a biomedical research area. OpenAI's GPT-3.5, GPT-4, Microsoft's Prometheus, and a custom RAG model were used to answer 19 questions pertaining to diffuse large B-cell lymphoma (DLBCL) disease biology and treatment. Eight independent reviewers assessed LLM responses based on accuracy, relevance, and readability, rating responses on a 3-point scale for each category. These scores were then used to compare LLM performance. The performance of the LLMs varied across scoring categories. On accuracy and relevance, the RAG model outperformed other models with higher scores on average and the most top scores across questions. GPT-4 was more comparable to the RAG model on relevance versus accuracy. By the same measures, GPT-4 and GPT-3.5 had the highest scores for readability of answers when compared to the other LLMs. GPT-4 and 3.5 also had more answers with hallucinations than the other LLMs, due to non-existent references and inaccurate responses to clinical questions. Our findings suggest that an oncology research-focused RAG model may outperform general-purpose LLMs in accuracy and relevance when answering subject-related questions. This framework can be tailored to Q&A in other subject areas. Further research will help understand the impact of LLM architectures, RAG methodologies, and prompting techniques in answering questions across different subject areas. Author summary: Large language models (LLMs) have recently made a significant impact on the field of general artificial intelligence and are beginning to be incorporated for a variety of tasks across industries. Their utility in generating precise information pertaining to specific subject areas is actively being explored. Here we presented application of a retrieval-augmented generation (RAG) LLM, which utilizes literature specific to cancer research, and compared its performance to three other general purpose LLMs (e.g. GPT-4) in answering questions specific to cancer research. We found that the RAG model produced generally more accurate and relevant answers to questions about treatment and biology of a specific blood cancer, while general purpose LLMs GPT-4 and GPT-3.5 had generally more readable answers but with more instances of incorrect information (i.e. hallucinations). This work showcases a practical application of LLMs in cancer research and attempts to evaluate how augmenting LLMs with credible source information can help improve their utility in a research setting.
Subjects: LANGUAGE &; languages; ARTIFICIAL intelligence; DIGITAL health; DESCRIPTIVE statistics; MATHEMATICAL models; INFORMATION retrieval; THEORY; DATA analysis software; RELIABILITY (Personality trait)
Publication: PLoS Digital Health, 2024, Vol 3, Issue 8, p1
ISSN: 2767-3170
Publication type: Article
DOI: 10.1371/journal.pdig.0000568

We found a match

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model.

Soong, David; Sridhar, Sriram; Si, Han; Wagner, Jan-Samuel; Sá, Ana Caroline Costa; Yu, Christina Y.; Karagoz, Kubra; Guan, Meijian; Kumar, Sanyam; Hamadeh, Hisham; Higgs, Brandon W.

LANGUAGE &; languages; ARTIFICIAL intelligence; DIGITAL health; DESCRIPTIVE statistics; MATHEMATICAL models; INFORMATION retrieval; THEORY; DATA analysis software; RELIABILITY (Personality trait)

PLoS Digital Health, 2024, Vol 3, Issue 8, p1

2767-3170

Article

10.1371/journal.pdig.0000568