Using Large Language Models to Assess Tutors' Performance in Reacting to Students Making Math Errors
Project Overview
The document examines the application of generative AI, particularly large language models like GPT-3.5-Turbo and GPT-4, in the educational context of assessing tutor performance in addressing students' math errors. It emphasizes the potential of generative AI to provide real-time feedback, which can be particularly beneficial for low-efficacy students by offering indirect guidance. The findings indicate that while these models demonstrate competence in evaluating certain aspects of tutor responses, they face challenges in accurately detecting when a student has made an error. The document suggests that further research is necessary to improve the models' capabilities and generalizability, which could involve analyzing larger datasets and investigating various tutoring skills. Overall, the implications of generative AI in education highlight its promise for enhancing tutoring effectiveness and student support.
Key Applications
Using LLMs to assess tutor performance in reacting to student errors
Context: Online tutoring sessions with middle school students (grades 6-8) who struggle with math
Implementation: Evaluation of real-life tutoring dialogues using LLMs to assess tutor responses according to specified criteria.
Outcomes: LLMs demonstrate proficiency in assessing certain criteria (e.g., immediate and accurate responses) but have limitations in recognizing student errors.
Challenges: LLMs struggle to accurately identify instances of errors made by students, often overidentifying situations of uncertainty.
Implementation Barriers
Technical Limitations
LLMs have difficulty accurately identifying when a student has made an error, with a tendency to misinterpret uncertainty as an error.
Proposed Solutions: Future work will involve improving prompt engineering and analyzing larger datasets to enhance model performance.
Cost
GPT-4 is significantly more expensive to use compared to GPT-3.5-Turbo, raising questions about the cost-benefit ratio of switching models.
Proposed Solutions: Focus on the cost-effective use of GPT-3.5-Turbo while enhancing prompts to improve accuracy.
Data Limitations
The dataset is small, limiting the generalizability of findings.
Proposed Solutions: Increase the number of dialogues analyzed to enhance the robustness of the study's conclusions.
Project Team
Sanjit Kakarla
Researcher
Danielle Thomas
Researcher
Jionghao Lin
Researcher
Shivang Gupta
Researcher
Kenneth R. Koedinger
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Sanjit Kakarla, Danielle Thomas, Jionghao Lin, Shivang Gupta, Kenneth R. Koedinger
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai