MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework

Project Overview

The document explores the integration of generative AI in education, particularly within medical training, by introducing the MedQA-CS framework, which evaluates large language models (LLMs) in clinical skills through a novel AI-SCE approach inspired by Objective Structured Clinical Examinations (OSCEs). This framework seeks to address the limitations of traditional assessment methods, particularly multiple-choice questions, by incorporating both quantitative and qualitative measures to better benchmark LLMs' abilities in tasks such as information gathering and differential diagnoses. The application of AI tools in assessing medical students is shown to improve accuracy and provide detailed feedback, enhancing the overall educational experience. However, the document also highlights significant challenges, including data privacy concerns, the necessity for careful implementation, and the potential for algorithmic bias. Furthermore, it reviews the performance of various language models in generating clinical questions and conducting medical examinations, noting that while models like GPT-4o and Claude Opus demonstrate strong capabilities, many smaller models struggle with issues such as repetition and limited contextual understanding, ultimately affecting their effectiveness and originality in question generation and evaluation. Overall, the findings indicate that while generative AI holds promise for improving medical education assessments, careful consideration of its limitations and challenges is essential for successful implementation.

Key Applications

AI Evaluation and Training Tools

Context: Utilization of AI tools, particularly large language models (LLMs), to evaluate clinical skills and enhance the training of medical students through simulated patient encounters and clinical exam responses. These tools assess responses, provide feedback, and simulate interactions to improve diagnostic skills.

Implementation: The implementation involves AI frameworks that benchmark LLMs against clinical skills tasks based on real-world scenarios. These tools evaluate student responses using predefined rubrics that focus on exam coverage, relevance, and accuracy. LLMs are also employed to simulate a medical student's inquiries during clinical assessments, enhancing engagement and interaction.

Outcomes: ['Improved assessment accuracy and detailed feedback for students.', 'Enhanced engagement in clinical scenarios and high correlation with expert evaluations.', 'A more rigorous assessment of clinical capabilities supports the integration of AI in medical education.']

Challenges: ['LLMs perform lower on clinical skills compared to traditional assessments, raising reliability issues.', 'Data privacy concerns and potential biases in AI evaluation.', 'Repetition in generated questions and limited context understanding, alongside performance variability among models.']

Implementation Barriers

Technical

Existing benchmarks primarily focus on clinical knowledge rather than practical skills, limiting their effectiveness in real-world evaluations. Additionally, the integration of AI tools in educational settings may face technical challenges, such as compatibility with existing systems and infrastructure. Limited context length and understanding in smaller models leads to repetitive question generation.

Proposed Solutions: Develop AI-SCE frameworks that incorporate practical assessments reflective of real clinical scenarios. Invest in robust IT support and training for educators to effectively utilize AI tools in their assessments. Use larger models that can handle extended context and develop a better understanding of instructions.

Data Quality

The quality of the generated outputs by LLMs does not consistently align with expert evaluations.

Proposed Solutions: Use expert-designed evaluation criteria and prompts to guide the LLMs' performance.

Ethical

Concerns regarding data privacy and the ethical implications of using AI in student evaluations.

Proposed Solutions: Adopting data protection regulations and ethical guidelines for AI use in educational contexts.

Bias

Potential biases in AI algorithms that may affect the fairness of evaluations.

Proposed Solutions: Regular audits and updates of AI algorithms to minimize bias and ensure equitable assessments.

Evaluation Barrier

The subjective nature of scoring and variability in expert judgments create inconsistencies.

Proposed Solutions: Standardize evaluation criteria and provide clearer guidelines for expert reviewers.

Project Team

Zonghai Yao

Researcher

Zihao Zhang

Researcher

Chaolong Tang

Researcher

Xingyu Bian

Researcher

Youxia Zhao

Researcher

Zhichao Yang

Researcher

Junda Wang

Researcher

Huixue Zhou

Researcher

Won Seok Jang

Researcher

Feiyun Ouyang

Researcher

Hong Yu

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Zonghai Yao, Zihao Zhang, Chaolong Tang, Xingyu Bian, Youxia Zhao, Zhichao Yang, Junda Wang, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, Hong Yu

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

← Back to Projects