Using Large Language Models for Automated Grading of Student Writing about Science
Project Overview
The document explores the application of generative AI, particularly GPT-4, in the automated grading of student writing within Massive Open Online Courses (MOOCs) centered on science subjects. It highlights the capacity of Large Language Models (LLMs) to achieve grading reliability comparable to traditional methods, including peer evaluations and instructor assessments. The findings indicate that when LLMs are given suitable prompts, such as model answers and grading rubrics, they can generate grades that align closely with those of human graders. Additionally, the document examines challenges associated with grading open-ended responses and the dependability of peer assessments, suggesting that while LLMs demonstrate significant potential in educational settings, careful consideration is needed to address these issues effectively. Overall, the research underscores the transformative role of generative AI in enhancing grading efficiency and consistency in education.
Key Applications
Automated grading of writing assignments using GPT-4
Context: Massive Open Online Courses (MOOCs) for adult learners, with no prior science background, in topics like astronomy, astrobiology, and the history and philosophy of astronomy.
Implementation: An experiment was conducted where GPT-4 was provided with instructor model answers and rubrics to evaluate student writing assignments.
Outcomes: LLMs showed more reliability than peer grading and matched instructor grading for all three courses. They were effective in automating grading for large classes.
Challenges: LLMs struggled with grading more subjective assignments, especially in the history and philosophy course.
Implementation Barriers
Technical Limitations
LLMs face challenges in evaluating creative assignments and those requiring higher-order thinking. Future work aims to develop writing assignments and grading systems that play to the strengths of LLMs.
Sampling Bias
The study used a purposeful sampling method, which could introduce bias and may not represent the broader student population. Random sampling methods should be considered in future studies to improve representativeness.
Variability in Grading Interpretations
Differences in interpretations of grading rubrics between LLMs, instructors, and peer graders may affect grading consistency. Utilizing more detailed and standardized rubrics could help align grading interpretations.
Project Team
Chris Impey
Researcher
Matthew Wenger
Researcher
Nikhil Garuda
Researcher
Shahriar Golchin
Researcher
Sarah Stamer
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Chris Impey, Matthew Wenger, Nikhil Garuda, Shahriar Golchin, Sarah Stamer
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai