A recent study has explored the capabilities of GPT-4, the latest iteration of OpenAI’s generative pretrained transformer, in handling radiology-related tasks. As large language models (LLMs) like ChatGPT become more integrated into various professional fields, understanding their baseline performance in specialized areas, such as radiology, is crucial.
The study aimed to evaluate GPT-4V’s ability to answer questions from the American College of Radiology’s Diagnostic Radiology In-Training Examinations. Researchers tested the model using 386 retired exam questions, which included both image-based and text-only questions, to gauge how well it performs on tasks that radiologists in training would face.
The results were mixed. Overall, GPT-4V correctly answered 65.3% of the 377 unique questions. However, its performance varied significantly between text-based and image-based questions. The model excelled at answering text-only questions, achieving an accuracy rate of 81.5%. In contrast, its accuracy dropped to 47.8% on image-based questions, indicating that while GPT-4V can process and analyze text effectively, it struggles with interpreting radiologic images.
Further analysis looked at the impact of different types of prompts on GPT-4V’s performance. For text-based questions, the study found that chain-of-thought prompting—a method that guides the model through a step-by-step reasoning process—led to better results than other prompting styles. This approach improved accuracy by up to 8.9% compared to the original prompt style used in the model. However, the choice of prompt had little effect on the model’s performance with image-based questions, underscoring the challenges GPT-4V faces in visual interpretation.
The findings suggest that while GPT-4V shows promise in handling certain radiology-related tasks, particularly those that are text-heavy, it is less reliable when it comes to interpreting medical images—a critical component of radiology. These results highlight the current limitations of AI in this field and suggest that while LLMs can be useful tools, they are not yet ready to replace human expertise in radiologic diagnostics.
