Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors

Project Overview

The document examines the role of generative AI, specifically large language model (LLM)-powered AI tutors, in enhancing education, particularly in mathematics. It introduces a comprehensive evaluation taxonomy consisting of eight dimensions designed to assess the pedagogical effectiveness of AI tutors in educational dialogues. A key focus is on the ability of these AI systems to identify and correct mistakes, which is crucial for effective learning. The study introduces MRBench, a benchmark tool for evaluating the performance of AI tutors, and summarizes findings that reveal both strengths and weaknesses of different LLMs in educational applications. Overall, the document underscores the potential of generative AI to improve educational outcomes through tailored tutoring while also highlighting areas needing further development to maximize their effectiveness.

Key Applications

LLM-powered AI tutors for mistake remediation in mathematics

Context: Educational dialogues focused on student mistakes in mathematics, targeting middle school students.

Implementation: AI tutors were evaluated based on their ability to respond to student mistakes using a unified evaluation taxonomy.

Outcomes: The taxonomy aids in assessing the effectiveness of AI tutors in providing pedagogical support, revealing strengths and weaknesses of different LLMs.

Challenges: LLMs often fail to provide sufficient pedagogical support and can reveal answers too quickly, which diminishes their effectiveness as tutors.

Implementation Barriers

Technical

Existing evaluation metrics for AI tutors do not adequately capture pedagogical values and require ground truth references that are often unavailable.

Proposed Solutions: Development of a unified evaluation taxonomy that aligns with learning sciences principles and enhances the evaluation of AI tutors.

Pedagogical

LLMs struggle with understanding complex pedagogical concepts, leading to unreliable evaluations of their tutoring capabilities.

Proposed Solutions: Research and training on pedagogically rich datasets to better align LLMs with human tutoring values.

Project Team

Kaushal Kumar Maurya

Researcher

KV Aditya Srivatsa

Researcher

Kseniia Petukhova

Researcher

Ekaterina Kochmar

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Kaushal Kumar Maurya, KV Aditya Srivatsa, Kseniia Petukhova, Ekaterina Kochmar

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

← Back to Projects