Trends

How ChatGPT struggles with field-specific relevance

March 28, 2024

In a recent study published in The Annals of Family Medicine, a group of researchers evaluated Chat Generative Pretrained Transformer (ChatGPT)’s efficacy in summarizing medical abstracts to aid physicians by providing concise, accurate, and unbiased summaries amidst the rapid expansion of clinical knowledge and limited review time.

Background
In 2020, nearly a million new journal articles were indexed by PubMed, reflecting the rapid doubling of global medical knowledge every 73 days. This growth, coupled with clinical models prioritizing productivity, leaves physicians little time to keep up with literature, even in their own specialties. Artificial Intelligence (AI) and natural language processing offer promising tools to address this challenge. Large Language Models (LLMs) like ChatGPT, which can generate text, summarize, and predict, have gained attention for potentially aiding physicians in efficiently reviewing medical literature. However, LLMs can produce misleading, non-factual text or “hallucinate” and may reflect biases from their training data, raising concerns about their responsible use in healthcare.

About the study
In the present study, researchers selected 10 articles from each of the 14 journals, including a broad range of medical topics, article structures, and journal impact factors. They aimed to include diverse study types while excluding non-research materials. The selection process was designed to ensure that all articles published in 2022 were unknown to ChatGPT, which had been trained on data available until 2021, to eliminate the possibility of the model having prior exposure to the content.

The researchers then tasked ChatGPT with summarizing these articles, self-assessing the summaries for quality, accuracy, and bias, and evaluating their relevance across ten medical fields. They limited summaries to 125 words and collected data on the model’s performance in a structured database.

Physician reviewers independently evaluated the ChatGPT-generated summaries, assessing them for quality, accuracy, bias, and relevance with a standardized scoring system. Their review process was carefully structured to ensure impartiality and a comprehensive understanding of the summaries’ utility and reliability.

The study conducted detailed statistical and qualitative analyses to compare the performance of ChatGPT summaries against human assessments. This included examining the alignment between ChatGPT’s article relevance ratings and those assigned by physicians, both at the journal and article levels.

Study results
The study utilized ChatGPT to condense 140 medical abstracts from 14 diverse journals, predominantly featuring structured formats. The abstracts, on average, contained 2,438 characters, which ChatGPT successfully reduced by 70% to 739 characters. Physicians evaluated these summaries, rating them highly for quality and accuracy and demonstrating minimal bias, a finding mirrored in ChatGPT’s self-assessment. Notably, the study observed no significant variance in these ratings when comparing across journals or between structured and unstructured abstract formats.

Despite the high ratings, the team did identify some instances of serious inaccuracies and hallucinations in a small fraction of the summaries. These errors ranged from omitted critical data to misinterpretations of study designs, potentially altering the interpretation of research findings. Additionally, minor inaccuracies were noted, typically involving subtle aspects that did not drastically change the abstract’s original meaning but could introduce ambiguity or oversimplify complex outcomes.

A key component of the study was examining ChatGPT’s capability to recognize the relevance of articles to specific medical disciplines. The expectation was that ChatGPT could accurately identify the topical focus of journals, aligning with predefined assumptions about their relevance to various medical fields. This hypothesis held true at the journal level, with a significant alignment between the relevance scores assigned by ChatGPT and those by physicians, indicating ChatGPT’s strong ability to grasp the overall thematic orientation of different journals.

However, when evaluating the relevance of individual articles to specific medical specialties, ChatGPT’s performance was less impressive, showing only a modest correlation with human-assigned relevance scores. This discrepancy highlighted a limitation in ChatGPT’s ability to accurately pinpoint the relevance of singular articles within the broader context of medical specialties despite a generally reliable performance on a broader scale.

Further analyses, including sensitivity and quality assessments, revealed a consistent distribution of quality, accuracy, and bias scores across individual and collective human reviews as well as those conducted by ChatGPT. This consistency suggested effective standardization among human reviewers and aligned closely with ChatGPT’s assessments, indicating a broad agreement on the summarization performance despite the challenges identified.

Conclusions
To summarize, the study’s findings indicated that ChatGPT effectively produced concise, accurate, and low-bias summaries, suggesting its utility for clinicians in quickly screening articles. However, ChatGPT struggled with accurately determining the relevance of articles to specific medical fields, limiting its potential as a digital agent for literature surveillance. Acknowledging limitations such as its focus on high-impact journals and structured abstracts, the study highlighted the need for further research. It suggests that future iterations of language models may offer improvements in summarization quality and relevance classification, advocating for responsible AI use in medical research and practice. News Medical

Medical Buyer

How ChatGPT struggles with field-specific relevance