Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems
Project Overview
This document explores the role of generative AI in education, emphasizing its application in Automated Essay Scoring (AES) systems, which have seen a surge in use due to the COVID-19 pandemic. It underscores the growing demand for effective and reliable scoring systems for written responses, prompting the development of a model-agnostic evaluation toolkit designed to assess AES performance across multiple criteria, such as coherence and grammar. The findings indicate that many existing AES models exhibit over-stability, which hinders their ability to distinguish between high-quality and poorly constructed essays, raising concerns regarding their reliability, especially in high-stakes environments. Overall, the document highlights the potential of generative AI to enhance educational assessment while also addressing the critical need for improvement in the precision and effectiveness of these systems.
Key Applications
Automated Essay Scoring (AES) Systems
Context: High school and middle school classrooms, standardized tests, college admissions, and job screening.
Implementation: AES systems analyze essays using natural language processing techniques to score written responses automatically.
Outcomes: Cost and time savings for educators, with the capability to handle large-scale assessments efficiently.
Challenges: Overstability of models, inability to detect adversarial modifications, and reliance on limited performance metrics.
Implementation Barriers
Technical Limitation
Current AES models struggle to accurately assess the quality of essays due to their inability to detect adversarial inputs and reliance on flawed metrics.
Proposed Solutions: Develop more comprehensive evaluation metrics beyond agreement scores and implement adversarial training techniques.
Trust and Acceptance
There is skepticism regarding the reliability of AI-based scoring systems among educators and the public.
Proposed Solutions: Increase transparency in model evaluations and improve the robustness of scoring systems to build trust.
Project Team
Anubha Kabra
Researcher
Mehar Bhatia
Researcher
Yaman Kumar
Researcher
Junyi Jessy Li
Researcher
Rajiv Ratn Shah
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Anubha Kabra, Mehar Bhatia, Yaman Kumar, Junyi Jessy Li, Rajiv Ratn Shah
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai