Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences
Project Overview
The document examines the integration of generative AI into education through the use of EvalAssist, a tool that leverages large language models (LLMs) to enhance evaluation criteria in educational assessments. It focuses on two main assessment strategies: Direct Assessment, which offers clarity and control, and Pairwise Comparison, which allows for greater flexibility in subjective evaluations. The findings reveal that while both strategies have their advantages, the effectiveness of these assessments is heavily influenced by the continuous refinement of evaluation criteria and the incorporation of user feedback. Additionally, the study highlights significant challenges such as addressing potential biases in AI outputs and fostering user trust in these systems. Overall, the document underscores the potential of generative AI to improve educational assessment practices while also calling for careful consideration of its limitations and ethical implications.
Key Applications
EvalAssist
Context: Evaluation of large language models outputs by practitioners in machine learning, software engineering, and AI engineering.
Implementation: Implemented as a web-based tool that allows users to refine evaluation criteria using LLMs for direct and pairwise assessments.
Outcomes: The tool enhances user engagement through iterative criteria refinement, improves human-AI alignment in evaluations, and provides metrics for positional bias.
Challenges: Users may experience criteria drift, where they modify their evaluation standards over time, and there can be difficulties in defining clear evaluation criteria.
Implementation Barriers
Technical Barrier
Users often find it challenging to define effective criteria without seeing a range of possible outputs, leading to criteria drift.
Proposed Solutions: Provide diverse task contexts to help users develop more robust evaluation criteria.
User Experience Barrier
Participants may struggle with the natural language criteria formulation, leading to suboptimal results.
Proposed Solutions: Implement auto-correction features and offer more examples for users to customize their inputs.
Project Team
Zahra Ashktorab
Researcher
Michael Desmond
Researcher
Qian Pan
Researcher
James M. Johnson
Researcher
Martin Santillan Cooper
Researcher
Elizabeth M. Daly
Researcher
Rahul Nair
Researcher
Tejaswini Pedapati
Researcher
Swapnaja Achintalwar
Researcher
Werner Geyer
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Zahra Ashktorab, Michael Desmond, Qian Pan, James M. Johnson, Martin Santillan Cooper, Elizabeth M. Daly, Rahul Nair, Tejaswini Pedapati, Swapnaja Achintalwar, Werner Geyer
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai