Skip to main content Skip to navigation

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

Project Overview

The document examines the role of Large Language Models (LLMs) in education, particularly focusing on their application in automatic scoring within science education. It highlights the advantages of LLMs, such as offering prompt feedback and lowering grading expenses, while also addressing the limitations these models face in aligning with human grading standards. The findings indicate that LLMs can enhance scoring precision through the development of analytic rubrics tailored for specific educational contexts. However, there is a critical need to ensure that these rubrics reflect the expectations of human graders to maintain the integrity of assessments. Overall, the document underscores the potential of generative AI to transform educational practices by improving efficiency in grading and fostering more accurate evaluations, while also advocating for careful consideration of the alignment between AI-generated tools and traditional grading metrics.

Key Applications

Automatic Scoring using LLMs

Context: Science education, specifically for middle school students assessing their understanding of scientific phenomena.

Implementation: LLMs are prompted to generate analytic rubrics and score student responses based on these rubrics.

Outcomes: Improved feedback timeliness and potential cost savings in grading; however, there is a notable alignment gap between LLM and human grading.

Challenges: LLMs often take shortcuts in reasoning, leading to misalignment with human grading logic and potential inaccuracies in scoring.

Implementation Barriers

Technical Barrier

LLMs may not follow the same logical reasoning as human graders, leading to inconsistencies in scoring.

Proposed Solutions: Incorporating high-quality human-designed analytic rubrics to guide LLM scoring and improve alignment with human expectations.

Data Barrier

The need for extensive training data can be costly and time-consuming.

Proposed Solutions: Exploring methods to transform scoring tasks into pre-training tasks for LLMs to avoid extensive data preparation.

Ethical Barrier

Concerns about the reliability of LLMs for scoring sensitive tasks like student assessments.

Proposed Solutions: Ensuring explainability in LLM scoring procedures to understand their decision-making processes.

Project Team

Xuansheng Wu

Researcher

Padmaja Pravin Saraf

Researcher

Gyeonggeon Lee

Researcher

Ehsan Latif

Researcher

Ninghao Liu

Researcher

Xiaoming Zhai

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Xuansheng Wu, Padmaja Pravin Saraf, Gyeonggeon Lee, Ehsan Latif, Ninghao Liu, Xiaoming Zhai

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies