Skip to main content Skip to navigation

Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences

Project Overview

The document examines the integration of generative AI into education through the use of EvalAssist, a tool that leverages large language models (LLMs) to enhance evaluation criteria in educational assessments. It focuses on two main assessment strategies: Direct Assessment, which offers clarity and control, and Pairwise Comparison, which allows for greater flexibility in subjective evaluations. The findings reveal that while both strategies have their advantages, the effectiveness of these assessments is heavily influenced by the continuous refinement of evaluation criteria and the incorporation of user feedback. Additionally, the study highlights significant challenges such as addressing potential biases in AI outputs and fostering user trust in these systems. Overall, the document underscores the potential of generative AI to improve educational assessment practices while also calling for careful consideration of its limitations and ethical implications.

Key Applications

EvalAssist

Context: Evaluation of large language models outputs by practitioners in machine learning, software engineering, and AI engineering.

Implementation: Implemented as a web-based tool that allows users to refine evaluation criteria using LLMs for direct and pairwise assessments.

Outcomes: The tool enhances user engagement through iterative criteria refinement, improves human-AI alignment in evaluations, and provides metrics for positional bias.

Challenges: Users may experience criteria drift, where they modify their evaluation standards over time, and there can be difficulties in defining clear evaluation criteria.

Implementation Barriers

Technical Barrier

Users often find it challenging to define effective criteria without seeing a range of possible outputs, leading to criteria drift.

Proposed Solutions: Provide diverse task contexts to help users develop more robust evaluation criteria.

User Experience Barrier

Participants may struggle with the natural language criteria formulation, leading to suboptimal results.

Proposed Solutions: Implement auto-correction features and offer more examples for users to customize their inputs.

Project Team

Zahra Ashktorab

Researcher

Michael Desmond

Researcher

Qian Pan

Researcher

James M. Johnson

Researcher

Martin Santillan Cooper

Researcher

Elizabeth M. Daly

Researcher

Rahul Nair

Researcher

Tejaswini Pedapati

Researcher

Swapnaja Achintalwar

Researcher

Werner Geyer

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Zahra Ashktorab, Michael Desmond, Qian Pan, James M. Johnson, Martin Santillan Cooper, Elizabeth M. Daly, Rahul Nair, Tejaswini Pedapati, Swapnaja Achintalwar, Werner Geyer

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies