Skip to main content Skip to navigation

Using Large Language Models to Assess Tutors' Performance in Reacting to Students Making Math Errors

Project Overview

The document examines the application of generative AI, particularly large language models like GPT-3.5-Turbo and GPT-4, in the educational context of assessing tutor performance in addressing students' math errors. It emphasizes the potential of generative AI to provide real-time feedback, which can be particularly beneficial for low-efficacy students by offering indirect guidance. The findings indicate that while these models demonstrate competence in evaluating certain aspects of tutor responses, they face challenges in accurately detecting when a student has made an error. The document suggests that further research is necessary to improve the models' capabilities and generalizability, which could involve analyzing larger datasets and investigating various tutoring skills. Overall, the implications of generative AI in education highlight its promise for enhancing tutoring effectiveness and student support.

Key Applications

Using LLMs to assess tutor performance in reacting to student errors

Context: Online tutoring sessions with middle school students (grades 6-8) who struggle with math

Implementation: Evaluation of real-life tutoring dialogues using LLMs to assess tutor responses according to specified criteria.

Outcomes: LLMs demonstrate proficiency in assessing certain criteria (e.g., immediate and accurate responses) but have limitations in recognizing student errors.

Challenges: LLMs struggle to accurately identify instances of errors made by students, often overidentifying situations of uncertainty.

Implementation Barriers

Technical Limitations

LLMs have difficulty accurately identifying when a student has made an error, with a tendency to misinterpret uncertainty as an error.

Proposed Solutions: Future work will involve improving prompt engineering and analyzing larger datasets to enhance model performance.

Cost

GPT-4 is significantly more expensive to use compared to GPT-3.5-Turbo, raising questions about the cost-benefit ratio of switching models.

Proposed Solutions: Focus on the cost-effective use of GPT-3.5-Turbo while enhancing prompts to improve accuracy.

Data Limitations

The dataset is small, limiting the generalizability of findings.

Proposed Solutions: Increase the number of dialogues analyzed to enhance the robustness of the study's conclusions.

Project Team

Sanjit Kakarla

Researcher

Danielle Thomas

Researcher

Jionghao Lin

Researcher

Shivang Gupta

Researcher

Kenneth R. Koedinger

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Sanjit Kakarla, Danielle Thomas, Jionghao Lin, Shivang Gupta, Kenneth R. Koedinger

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies