Skip to main content Skip to navigation

It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

Project Overview

The document evaluates the role of Large Language Models (LLMs) in medical education, particularly their performance on multiple-choice questions (MCQs) versus free-response questions. It identifies the limitations of MCQs in effectively assessing the medical knowledge and reasoning skills of LLMs, indicating that existing benchmarks may inflate their actual capabilities. To address this issue, the authors developed a new benchmark called FreeMedQA, which facilitates a comparison of LLM performance across both MCQ and free-response formats. The findings reveal a significant decline in LLM performance when faced with the more demanding free-response questions, underscoring the necessity for improved evaluation methods in medical education that prioritize authentic comprehension over mere pattern recognition. This study highlights critical implications for the integration of generative AI in educational assessments, suggesting that while AI can assist in learning, reliance solely on traditional testing formats may not adequately measure true understanding or skill mastery.

Key Applications

FreeMedQA - a benchmark of paired free-response and multiple-choice questions

Context: Medical education for medical students and AI models in clinical settings

Implementation: Developed a new benchmark for evaluating LLMs' performance on medical questions, comparing MCQs and free-response formats.

Outcomes: Average performance drop of 39.43% for LLMs when transitioning from MCQs to free-response questions, suggesting that existing MCQ benchmarks may overestimate LLM capabilities.

Challenges: Current MCQ benchmarks do not accurately reflect LLM understanding; performance drops significantly in free-response contexts.

Implementation Barriers

Assessment Limitations

Multiple-choice questions (MCQs) may not accurately assess the true medical knowledge and reasoning capabilities of LLMs.

Proposed Solutions: Develop and implement free-response question assessments that can more accurately gauge understanding.

Project Team

Shrutika Singh

Researcher

Anton Alyakin

Researcher

Daniel Alexander Alber

Researcher

Jaden Stryker

Researcher

Ai Phuong S Tong

Researcher

Karl Sangwon

Researcher

Nicolas Goff

Researcher

Mathew de la Paz

Researcher

Miguel Hernandez-Rovira

Researcher

Ki Yun Park

Researcher

Eric Claude Leuthardt

Researcher

Eric Karl Oermann

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Shrutika Singh, Anton Alyakin, Daniel Alexander Alber, Jaden Stryker, Ai Phuong S Tong, Karl Sangwon, Nicolas Goff, Mathew de la Paz, Miguel Hernandez-Rovira, Ki Yun Park, Eric Claude Leuthardt, Eric Karl Oermann

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies