It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education
Project Overview
The document evaluates the role of Large Language Models (LLMs) in medical education, particularly their performance on multiple-choice questions (MCQs) versus free-response questions. It identifies the limitations of MCQs in effectively assessing the medical knowledge and reasoning skills of LLMs, indicating that existing benchmarks may inflate their actual capabilities. To address this issue, the authors developed a new benchmark called FreeMedQA, which facilitates a comparison of LLM performance across both MCQ and free-response formats. The findings reveal a significant decline in LLM performance when faced with the more demanding free-response questions, underscoring the necessity for improved evaluation methods in medical education that prioritize authentic comprehension over mere pattern recognition. This study highlights critical implications for the integration of generative AI in educational assessments, suggesting that while AI can assist in learning, reliance solely on traditional testing formats may not adequately measure true understanding or skill mastery.
Key Applications
FreeMedQA - a benchmark of paired free-response and multiple-choice questions
Context: Medical education for medical students and AI models in clinical settings
Implementation: Developed a new benchmark for evaluating LLMs' performance on medical questions, comparing MCQs and free-response formats.
Outcomes: Average performance drop of 39.43% for LLMs when transitioning from MCQs to free-response questions, suggesting that existing MCQ benchmarks may overestimate LLM capabilities.
Challenges: Current MCQ benchmarks do not accurately reflect LLM understanding; performance drops significantly in free-response contexts.
Implementation Barriers
Assessment Limitations
Multiple-choice questions (MCQs) may not accurately assess the true medical knowledge and reasoning capabilities of LLMs.
Proposed Solutions: Develop and implement free-response question assessments that can more accurately gauge understanding.
Project Team
Shrutika Singh
Researcher
Anton Alyakin
Researcher
Daniel Alexander Alber
Researcher
Jaden Stryker
Researcher
Ai Phuong S Tong
Researcher
Karl Sangwon
Researcher
Nicolas Goff
Researcher
Mathew de la Paz
Researcher
Miguel Hernandez-Rovira
Researcher
Ki Yun Park
Researcher
Eric Claude Leuthardt
Researcher
Eric Karl Oermann
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Shrutika Singh, Anton Alyakin, Daniel Alexander Alber, Jaden Stryker, Ai Phuong S Tong, Karl Sangwon, Nicolas Goff, Mathew de la Paz, Miguel Hernandez-Rovira, Ki Yun Park, Eric Claude Leuthardt, Eric Karl Oermann
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai