Skip to main content Skip to navigation

Comparing Human and AI Rater Effects Using the Many-Facet Rasch Model

Project Overview

The document examines the use of large language models (LLMs) in education, specifically their role in automating the scoring of writing assessments. It analyzes the scoring performance of various LLMs, including ChatGPT 4o, Claude 3.5, and Gemini 1.5 Pro, against human raters, revealing that these models exhibit high accuracy and improved consistency in scoring. Despite these promising results, the document highlights significant challenges that persist, such as the impact of rater biases and the necessity for additional research to establish the generalizability of these findings across diverse subject areas. Ultimately, while LLMs show potential for enhancing the efficiency and reliability of educational assessments, further investigation is required to address existing limitations and ensure their effective implementation in varied educational contexts.

Key Applications

Automated scoring using LLMs

Context: Educational assessment for college students in Mandarin Chinese courses

Implementation: Ten LLMs were compared to human raters in scoring essays and constructed-response items.

Outcomes: High scoring accuracy and improved efficiency in scoring with reduced costs; specific LLMs showed better performance in both holistic and analytic scoring.

Challenges: Variability in scoring accuracy among LLMs; potential rater effects; limited generalizability of findings.

Implementation Barriers

Technical Barrier

The development of feature-based or SLM-based automated scoring systems requires foundational skills in supervised machine learning, NLP, and deep learning, limiting accessibility to non-technical users. Emerging LLMs are becoming more user-friendly, enabling non-technical users to leverage the technology for scoring.

Validity Barrier

Concerns regarding the validity of scores assigned by AI compared to human raters, particularly in terms of rater effects. Implementing comprehensive training for LLMs and using human expert ratings to generate benchmark data can address these concerns.

Generalization Barrier

The findings may not generalize across different content domains beyond the AP Chinese exam. Further exploration of LLM effectiveness in various subjects and assessment types is needed.

Project Team

Hong Jiao

Researcher

Dan Song

Researcher

Won-Chan Lee

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Hong Jiao, Dan Song, Won-Chan Lee

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies