Expert-Level Accuracy of GPT-4V in Medicine Conceals Hidden Flaws

Latest News

A recent study has revealed that OpenAI’s Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in solving medical challenge tasks, specifically in the accuracy of multiple-choice questions. However, this impressive achievement comes with important caveats.

While GPT-4V demonstrated superior performance with an 81.6% accuracy rate compared to 77.8% for human physicians on NEJM Image Challenges, a closer examination revealed that the AI’s reasoning often fell short. This study, unlike previous ones focusing solely on answer accuracy, scrutinized GPT-4V’s ability to comprehend images, recall medical knowledge, and perform step-by-step reasoning.

Key findings include:

  1. Accuracy in Multi-choice Questions: GPT-4V achieved an accuracy rate of 81.6%, slightly higher than human physicians’ 77.8%. It also performed well in cases where physicians failed, answering over 78% of such questions correctly.
  2. Flawed Rationales: Despite its high accuracy, GPT-4V frequently provided flawed rationales, especially in image comprehension tasks. In 35.5% of correctly answered questions, the rationales were found lacking, with a significant 27.2% error rate in image comprehension alone.
  3. Reliability in Medical Knowledge Recall: The model showed the most reliability in recalling medical knowledge, with lower error rates ranging from 11.6% to 13.0%.

The study underscores the need for thorough evaluations of AI rationales before integrating such models into clinical workflows. While GPT-4V shows great promise, particularly in decision support roles, its current limitations in rationalizing decisions based on visual data highlight the necessity for cautious and incremental adoption in clinical settings. Further research and development are crucial to ensure these tools can reliably augment human expertise without compromising patient care.

- Advertisement -

Latest Videos