AgentClinic: a multimodal agent benchmark

Paper Reading: Evaluation of AI in simulated clinical environments

Sep 25, 2024

This paper introduces a multimodal benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. Two open medical agent benchmarks are mentioned:

AgentClinic-NEJM: a multimodal image together with dialogue environment
AgentClinic-MedQA: a dialogue-only environment

The most critical viewpoint is LLM used in the patient agent is an important factor for performance in the AgentClinic benchmark. Both having limited interactions as well as too many interaction reduces diagnostic accuracy in doctor agents.

Since clinical work is a multiplexed task that involves sequential decision making, requiring the doctor to handle uncertainty with limited information and finite resources while compassionately taking care of patients and obtaining relevant information from them.

Throughout the pipeline, they built four language agents: patient agent, doctor agent, measurement agent and moderator agent to serve different roles.

Two major biases are raised throughout the research: cognitive and implicit bias, where researchers introduce cognitive bias prompts and implicit bias prompt for both the patients and doctor agent. (recency bias). The experiment has shown that GPT-4 is highly robust to biases, the decrease is relatively small, while Mixtral-7x8b is more suited for bias studies.

When it comes to the type of patient language model that affects accuracy, the conclusion is GPT-4 results in higher diagnostic accuracy since it provides additional symptomatic information.

Between patient agent and doctor agent, there's a tendency that language model shows a self-preference bias, which suggest using same LLM within them.

Based on the provided information, N=20 appears to be the point where diagnostic accuracy is highest (52%) in the experiment.

Human clinical experts gave following feedback for both patient agent and doctor patients:

Patient Agent:
- Overly verbose and unnecessarily repeating the question back to the doctor agent
Doctor Agent:
- Providing a bad opening statement
- Making basic error
- Overly focusing on a particular diagnostic or not being diligent enough
- Neutral tone and doesn’t open the dialogue with inviting question
Measurement Agent:
- Occasionally not return all of the necessary values for a test

As for the diagnostic accuracy in a multimodal environment, there are two way to give image input to doctor agent:

Provide image initially to doctor agent:
- GPT-4o: 47%
- GPT-4-turbo & GPT-4-vision-preview: 27%
Provide image upon request from the instrument agent:
- GPT-4o: 27%
- GPT-4-turbo: 20%
- GPT-4-vision-preview: 13%

In the end, the paper also states that current work only presents a simplified clinical environments that include agents representing a patient, doctor, measurements and a moderator. They will also consider including additional critical actors, such as nurses, the relative of patients, administrators and insurance contacts.

Reference:

https://arxiv.org/pdf/2405.07960

Yvaine’s Substack

Discussion about this post