Skip to main content Skip to navigation

Large Language Models in Student Assessment: Comparing ChatGPT and Human Graders

Project Overview

The document explores the application of generative AI, particularly large language models like GPT-4, in the educational context of grading master-level political science essays. It reveals that while GPT-4's grading aligns with human standards, it tends to adopt a conservative approach, leading to low interrater reliability when compared to human evaluators. This highlights the necessity for further refinement of AI grading tools to better adapt to specific educational assessment criteria. Additionally, the study underscores the potential benefits of AI in alleviating the workload for educators, yet it also raises important concerns regarding the impact of AI-driven grading on the educational experience, particularly in terms of teacher-student dynamics. Overall, the findings suggest that while generative AI can be a valuable asset in education, careful consideration is required to balance its advantages with potential drawbacks.

Key Applications

AI-assisted grading and feedback generation using GPT-4.

Context: Higher education, specifically in master's level courses in political science and English as a New Language (ENL) contexts.

Implementation: GPT-4 was utilized to assess and provide feedback on student essays through grading prompts and AI-generated feedback mechanisms. This included grading a sample of essays and assessing the impact of AI-generated feedback compared to traditional human tutor feedback across different student populations.

Outcomes: GPT-4's grading aligned with human graders in mean scores but exhibited a risk-averse grading pattern and low interrater reliability. The studies indicated no significant difference in learning outcomes or preferences between AI-generated and human feedback.

Challenges: Key challenges included low interrater reliability, difficulty in adapting to nuanced grading criteria, limited knowledge among users about AI tools, and the necessity for specific training to implement AI feedback effectively.

Implementation Barriers

Technical Barrier

Low interrater reliability between AI grading and human grading.

Proposed Solutions: Further technological development is required to improve AI's adaptability and sensitivity to specific grading criteria.

Knowledge Barrier

Limited knowledge among educators on how to effectively implement AI tools.

Proposed Solutions: Providing specific training for educators to enhance their understanding and use of AI in educational settings.

Project Team

Magnus Lundgren

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Magnus Lundgren

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies