Skip to main content Skip to navigation

MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors

Project Overview

The document explores the application of generative AI in education, focusing on the MSA-M ATHEVAL system, which assesses AI tutor responses through key instructional dimensions such as Mistake Identification, Mistake Location, Providing Guidance, and Actionability. By utilizing a unified training pipeline and instruction-tuned large language models (LLMs), the system aims to enhance the effectiveness of tutoring by accurately recognizing student errors, offering tailored guidance, and suggesting actionable steps for improvement. The use of a disagreement-aware ensemble inference strategy further bolsters the reliability of predictions made by the AI. The findings indicate that this approach has demonstrated strong performance, particularly in math tutoring contexts, highlighting the potential of generative AI to improve educational outcomes by supporting personalized learning experiences. Overall, the document underscores the transformative role of AI technologies in fostering more effective educational practices and enhancing student learning through intelligent tutoring systems.

Key Applications

MSA-M ATHEVAL

Context: Evaluating AI tutor responses in mathematics education for students engaging with AI tutors.

Implementation: Utilizes a unified training pipeline fine-tuning the Mathstral-7B-v0.1 model with Low-Rank Adaptation (LoRA) for instruction-tuning without task-specific architecture changes.

Outcomes: Achieved top-tier performance across instructional dimensions, ranking 1st in Providing Guidance and consistently within the top 5 in other tracks.

Challenges: Challenges include over-specialization of the model to mathematical reasoning, inference cost due to ensemble strategies, and limitations in evaluation granularity.

Implementation Barriers

Technical Barrier

The specialization of the Mathstral-7B-v0.1 model to mathematical reasoning may hinder its generalization to non-mathematical domains.

Proposed Solutions: Future work will explore cross-domain generalization and dynamic calibration strategies to enhance robustness.

Operational Barrier

The ensemble disagreement strategy increases inference costs, which could be problematic if base models exhibit correlated predictions.

Proposed Solutions: Ensemble strategies should be designed to ensure model predictions are diverse and independent.

Evaluation Barrier

Reliance on macro-averaged F1 as the primary evaluation metric lacks granularity in penalizing critical pedagogical mistakes.

Proposed Solutions: Implement more nuanced evaluation metrics that can better capture instructional severity of errors.

Project Team

Baraa Hikal

Researcher

Mohamed Basem

Researcher

Islam Oshallah

Researcher

Ali Hamdi

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Baraa Hikal, Mohamed Basem, Islam Oshallah, Ali Hamdi

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies