Skip to main content Skip to navigation

Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems

Project Overview

This document explores the role of generative AI in education, emphasizing its application in Automated Essay Scoring (AES) systems, which have seen a surge in use due to the COVID-19 pandemic. It underscores the growing demand for effective and reliable scoring systems for written responses, prompting the development of a model-agnostic evaluation toolkit designed to assess AES performance across multiple criteria, such as coherence and grammar. The findings indicate that many existing AES models exhibit over-stability, which hinders their ability to distinguish between high-quality and poorly constructed essays, raising concerns regarding their reliability, especially in high-stakes environments. Overall, the document highlights the potential of generative AI to enhance educational assessment while also addressing the critical need for improvement in the precision and effectiveness of these systems.

Key Applications

Automated Essay Scoring (AES) Systems

Context: High school and middle school classrooms, standardized tests, college admissions, and job screening.

Implementation: AES systems analyze essays using natural language processing techniques to score written responses automatically.

Outcomes: Cost and time savings for educators, with the capability to handle large-scale assessments efficiently.

Challenges: Overstability of models, inability to detect adversarial modifications, and reliance on limited performance metrics.

Implementation Barriers

Technical Limitation

Current AES models struggle to accurately assess the quality of essays due to their inability to detect adversarial inputs and reliance on flawed metrics.

Proposed Solutions: Develop more comprehensive evaluation metrics beyond agreement scores and implement adversarial training techniques.

Trust and Acceptance

There is skepticism regarding the reliability of AI-based scoring systems among educators and the public.

Proposed Solutions: Increase transparency in model evaluations and improve the robustness of scoring systems to build trust.

Project Team

Anubha Kabra

Researcher

Mehar Bhatia

Researcher

Yaman Kumar

Researcher

Junyi Jessy Li

Researcher

Rajiv Ratn Shah

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Anubha Kabra, Mehar Bhatia, Yaman Kumar, Junyi Jessy Li, Rajiv Ratn Shah

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies