Skip to main content Skip to navigation

Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise Given to Students in Synthetic Dialogues

Project Overview

The document explores the transformative potential of generative AI, particularly GPT-4, in enhancing educational practices through the provision of constructive feedback for tutors. It evaluates GPT-4's ability to assess the quality of praise delivered by tutors to students, benchmarking its performance against human evaluators. The findings reveal that GPT-4 excels at recognizing specific and immediate forms of praise; however, it encounters challenges with more nuanced aspects, such as conveying sincerity. This highlights the critical role that effective feedback plays in the tutoring process. Additionally, the study delves into prompt engineering techniques aimed at refining AI evaluations, underscoring the necessity of tailoring AI responses to better serve educational contexts. Overall, the document illustrates both the capabilities and limitations of generative AI in supporting educational feedback mechanisms, advocating for continued exploration and enhancement of AI tools to optimize learning outcomes.

Key Applications

Using GPT-4 to provide feedback to tutors on their praise to students

Context: Tutoring settings, targeting both tutors and students

Implementation: Synthetic tutor-student dialogues generated by GPT-4 were assessed for their ability to provide feedback based on criteria of effective praise.

Outcomes: GPT-4 moderately identified effective praise elements; performed well in immediate and specific praise but struggled with sincerity.

Challenges: Limited accuracy in identifying sincere praise and the challenges in assessing nuanced tutor evaluations.

Implementation Barriers

Research Limitation

Lack of availability of real-life tutor-student dialogues limits the research's applicability.

Proposed Solutions: Incorporate real-life dialogues in future studies and increase the volume of chat logs used.

Sample Size Limitation

The small sample size of synthetic dialogues may affect the generalizability of findings.

Proposed Solutions: Increase the number of synthetic dialogues used to improve reliability and robustness.

Prompt Limitations

The few-shot prompts used were too simple and lacked a variety of examples.

Proposed Solutions: Integrate a wider range of nuanced examples in prompt engineering to enhance AI performance.

Project Team

Dollaya Hirunyasiri

Researcher

Danielle R. Thomas

Researcher

Jionghao Lin

Researcher

Kenneth R. Koedinger

Researcher

Vincent Aleven

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Dollaya Hirunyasiri, Danielle R. Thomas, Jionghao Lin, Kenneth R. Koedinger, Vincent Aleven

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies