Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings

Project Overview

The document examines the role of generative AI, particularly GPT-4, in enhancing educational assessments within higher education, focusing on macroeconomics written tasks. The findings indicate that GPT-4 exhibits high interrater reliability, effectively differentiating between content quality and stylistic elements in student evaluations. This capability positions AI as a promising tool for improving learning outcomes and delivering timely feedback to students. However, the study also points out significant challenges, including issues of reliability, transparency, and interpretability of AI-generated evaluations, suggesting that more research is required to fully leverage AI's advantages in the educational sector. Overall, while generative AI offers innovative solutions for assessment and feedback, addressing these challenges is crucial for its effective integration in educational practices.

Key Applications

GPT-4 for Automated Writing Evaluation

Context: Higher Education, specifically for macroeconomics courses

Implementation: GPT-4 was employed to assess student responses to macroeconomic questions, using a systematic prompt framework to ensure consistency in feedback across multiple iterations.

Outcomes: High interrater reliability with ICC scores between 0.94 and 0.99; immediate and targeted feedback on content and style, enhancing the learning experience.

Challenges: Variability in consistency across different question complexities; potential for AI to make errors and the 'black box' problem limiting transparency.

Implementation Barriers

Technical

AI models may produce inconsistent ratings or errors, which can undermine trust in the assessment process. Users' trust in AI grading systems can diminish if they perceive inaccuracies or inconsistencies in ratings.

Proposed Solutions: Implement rigorous testing and evaluation protocols to assess the consistency and reliability of AI outputs. Enhance transparency and explainability of AI models to build user confidence in AI-assisted evaluations.

Operational

Difficulty in adapting AI models to various educational contexts and ensuring they meet diverse learning needs.

Proposed Solutions: Further research into tailoring AI feedback mechanisms to specific educational scenarios and student needs.

Project Team

Veronika Hackl

Researcher

Alexandra Elena Müller

Researcher

Michael Granitzer

Researcher

Maximilian Sailer

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Veronika Hackl, Alexandra Elena Müller, Michael Granitzer, Maximilian Sailer

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

← Back to Projects