NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors
Project Overview
The document examines the application of generative AI in education, particularly through a system designed for the BEA 2025 Shared Task that addresses mistake identification in AI tutoring. It explores various methodologies, primarily leveraging machine learning and large language models (LLMs), to assess the ability of AI tutors to recognize errors in students' mathematical reasoning. A key finding is that the most successful strategy integrates retrieval-augmented prompting with LLMs, which enhances both the accuracy of mistake identification and the quality of pedagogical feedback provided to students. However, the system's effectiveness is tempered by challenges such as a lack of diverse examples, limitations in modeling multi-turn dialogues, and issues related to scalability. Overall, the document highlights the potential of generative AI to improve educational outcomes while also acknowledging the obstacles that need to be overcome for broader implementation.
Key Applications
Retrieval-Augmented Few-Shot Classification with LLM-as-a-Judge
Context: Educational context focusing on mistake identification in AI-powered math tutoring systems, targeting both students and educators.
Implementation: The approach uses a modular pipeline that retrieves semantically similar examples from a database and prompts a large language model (GPT-4o) to assess mistake identification in tutor responses.
Outcomes: Achieved best performance in evaluating mistake identification, with improved accuracy and nuanced pedagogical feedback.
Challenges: Limited diversity in retrieved examples, lack of multi-turn dialogue context, simplified output format, scalability and cost constraints.
Implementation Barriers
Technical
Limited diversity in retrieved examples may hinder performance on out-of-distribution dialogues, and lack of multi-turn dialogue modeling restricts the system's ability to track learning progression.
Proposed Solutions: Exploring more adaptive example selection and improving the example pool's coverage, as well as incorporating dialogue state tracking or memory-based retrieval for improved context handling.
Technical
Simplified output format restricts the model to a single label selection, missing nuances.
Proposed Solutions: Extending the output to include rationales or confidence scores for more informative evaluations.
Operational
Scalability and cost constraints due to computational intensity of using LLMs like GPT-4o.
Proposed Solutions: Investigating more efficient models or methods that reduce dependency on high-cost APIs.
Project Team
Numaan Naeem
Researcher
Sarfraz Ahmad
Researcher
Momina Ahsan
Researcher
Hasan Iqbal
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai