Skip to main content Skip to navigation

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

Project Overview

The document examines the application of Large Language Models (LLMs) in education, focusing on their role as evaluators for the automatic quality assessment of AI-generated text. It investigates the impact of different prompting strategies on LLMs' alignment with human judgments across various evaluation tasks, proposing a taxonomy of quality criteria that includes Content, Relevance, Integrity, and Engagement. The findings reveal that intricate prompting does not notably improve performance, especially for advanced models like GPT-4, which demonstrate a strong alignment with human evaluations even with minimal guidance. Furthermore, the document highlights perplexity as a viable alternative method for estimating text quality, suggesting that it may offer a more straightforward approach to assessment. Overall, the use of generative AI in educational contexts shows promise in enhancing evaluation processes, providing insights into quality assessment, and potentially streamlining the feedback mechanisms for AI-generated content.

Key Applications

LLMs as evaluators for task quality assessment

Context: Text quality evaluation in various natural language generation tasks, including summarization and question answering.

Implementation: LLMs are prompted with varying levels of detail about the evaluation criteria, from simple prompts to detailed rubrics.

Outcomes: High correlation between LLM and human judgments in text quality evaluation, particularly with simpler prompts for powerful models.

Challenges: Determining the most effective prompting strategy and the inherent biases in LLMs favoring their own outputs.

Implementation Barriers

Technical

Bias in LLMs preferring their outputs over others can distort evaluation results.

Proposed Solutions: Use perplexity as an unbiased metric for evaluation, or refine prompt engineering to minimize bias.

Methodological

Challenges in identifying the appropriate LLM and level of information needed for accurate evaluations.

Proposed Solutions: Develop clearer guidelines for when to use detailed prompts versus simpler instructions based on task complexity.

Project Team

Bhuvanashree Murugadoss

Researcher

Christian Poelitz

Researcher

Ian Drosos

Researcher

Vu Le

Researcher

Nick McKenna

Researcher

Carina Suzana Negreanu

Researcher

Chris Parnin

Researcher

Advait Sarkar

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos, Vu Le, Nick McKenna, Carina Suzana Negreanu, Chris Parnin, Advait Sarkar

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies