Skip to main content Skip to navigation

TutorGym: A Testbed for Evaluating AI Agents as Tutors and Students

Project Overview

The document discusses TutorGym, an innovative framework designed to assess the effectiveness of AI agents, including large language models (LLMs) and reinforcement learning agents, as tutors within Intelligent Tutoring Systems (ITS). TutorGym allows for rigorous testing of these AI entities in authentic educational contexts. Initial findings reveal that while LLMs can mimic human learning patterns convincingly, their current performance as tutoring agents is inadequate, particularly in delivering accurate feedback to learners. Despite these shortcomings, the document highlights the promising potential of LLMs as educational tools, emphasizing that with further development, they could enhance personalized learning experiences. However, it also notes that the implementation of such AI technologies involves significant challenges and costs that need to be addressed. Overall, the findings suggest that while generative AI has a substantial role to play in education, there is a need for ongoing research and refinement to unlock its full capabilities in supporting teaching and learning.

Key Applications

TutorGym

Context: Testing AI agents in classroom environments with existing Intelligent Tutoring Systems.

Implementation: AI agents interact with TutorGym's interface, which integrates with existing ITS platforms to evaluate their tutoring capacities and simulate student learning.

Outcomes: Initial testing revealed that LLMs performed poorly in tutoring roles, with accuracy rates of only 52-70% in providing next-step demonstrations.

Challenges: LLMs often generate incorrect outputs (hallucinations) and require significant computational resources, leading to high inference costs.

Implementation Barriers

Performance Barrier

LLMs show poor accuracy in grading student actions, failing to exceed chance levels in identifying incorrect actions.

Proposed Solutions: Future improvements in AI training methods and better integration of LLMs with traditional ITS designs may enhance performance.

Cost Barrier

The cost of API usage for LLMs can exceed $730 for evaluations, making frequent testing economically unfeasible.

Proposed Solutions: Developing more efficient local models or using open-source LLMs could mitigate costs.

Project Team

Daniel Weitekamp

Researcher

Momin N. Siddiqui

Researcher

Christopher J. MacLellan

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Daniel Weitekamp, Momin N. Siddiqui, Christopher J. MacLellan

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies