Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors
Project Overview
The document examines the role of generative AI, specifically large language model (LLM)-powered AI tutors, in enhancing education, particularly in mathematics. It introduces a comprehensive evaluation taxonomy consisting of eight dimensions designed to assess the pedagogical effectiveness of AI tutors in educational dialogues. A key focus is on the ability of these AI systems to identify and correct mistakes, which is crucial for effective learning. The study introduces MRBench, a benchmark tool for evaluating the performance of AI tutors, and summarizes findings that reveal both strengths and weaknesses of different LLMs in educational applications. Overall, the document underscores the potential of generative AI to improve educational outcomes through tailored tutoring while also highlighting areas needing further development to maximize their effectiveness.
Key Applications
LLM-powered AI tutors for mistake remediation in mathematics
Context: Educational dialogues focused on student mistakes in mathematics, targeting middle school students.
Implementation: AI tutors were evaluated based on their ability to respond to student mistakes using a unified evaluation taxonomy.
Outcomes: The taxonomy aids in assessing the effectiveness of AI tutors in providing pedagogical support, revealing strengths and weaknesses of different LLMs.
Challenges: LLMs often fail to provide sufficient pedagogical support and can reveal answers too quickly, which diminishes their effectiveness as tutors.
Implementation Barriers
Technical
Existing evaluation metrics for AI tutors do not adequately capture pedagogical values and require ground truth references that are often unavailable.
Proposed Solutions: Development of a unified evaluation taxonomy that aligns with learning sciences principles and enhances the evaluation of AI tutors.
Pedagogical
LLMs struggle with understanding complex pedagogical concepts, leading to unreliable evaluations of their tutoring capabilities.
Proposed Solutions: Research and training on pedagogically rich datasets to better align LLMs with human tutoring values.
Project Team
Kaushal Kumar Maurya
Researcher
KV Aditya Srivatsa
Researcher
Kseniia Petukhova
Researcher
Ekaterina Kochmar
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Kaushal Kumar Maurya, KV Aditya Srivatsa, Kseniia Petukhova, Ekaterina Kochmar
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai