Comparing Human and AI Rater Effects Using the Many-Facet Rasch Model
Project Overview
The document examines the use of large language models (LLMs) in education, specifically their role in automating the scoring of writing assessments. It analyzes the scoring performance of various LLMs, including ChatGPT 4o, Claude 3.5, and Gemini 1.5 Pro, against human raters, revealing that these models exhibit high accuracy and improved consistency in scoring. Despite these promising results, the document highlights significant challenges that persist, such as the impact of rater biases and the necessity for additional research to establish the generalizability of these findings across diverse subject areas. Ultimately, while LLMs show potential for enhancing the efficiency and reliability of educational assessments, further investigation is required to address existing limitations and ensure their effective implementation in varied educational contexts.
Key Applications
Automated scoring using LLMs
Context: Educational assessment for college students in Mandarin Chinese courses
Implementation: Ten LLMs were compared to human raters in scoring essays and constructed-response items.
Outcomes: High scoring accuracy and improved efficiency in scoring with reduced costs; specific LLMs showed better performance in both holistic and analytic scoring.
Challenges: Variability in scoring accuracy among LLMs; potential rater effects; limited generalizability of findings.
Implementation Barriers
Technical Barrier
The development of feature-based or SLM-based automated scoring systems requires foundational skills in supervised machine learning, NLP, and deep learning, limiting accessibility to non-technical users. Emerging LLMs are becoming more user-friendly, enabling non-technical users to leverage the technology for scoring.
Validity Barrier
Concerns regarding the validity of scores assigned by AI compared to human raters, particularly in terms of rater effects. Implementing comprehensive training for LLMs and using human expert ratings to generate benchmark data can address these concerns.
Generalization Barrier
The findings may not generalize across different content domains beyond the AP Chinese exam. Further exploration of LLM effectiveness in various subjects and assessment types is needed.
Project Team
Hong Jiao
Researcher
Dan Song
Researcher
Won-Chan Lee
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Hong Jiao, Dan Song, Won-Chan Lee
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai