Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions
Project Overview
The document examines the application of Large Language Models (LLMs) in education, focusing on their role as evaluators for the automatic quality assessment of AI-generated text. It investigates the impact of different prompting strategies on LLMs' alignment with human judgments across various evaluation tasks, proposing a taxonomy of quality criteria that includes Content, Relevance, Integrity, and Engagement. The findings reveal that intricate prompting does not notably improve performance, especially for advanced models like GPT-4, which demonstrate a strong alignment with human evaluations even with minimal guidance. Furthermore, the document highlights perplexity as a viable alternative method for estimating text quality, suggesting that it may offer a more straightforward approach to assessment. Overall, the use of generative AI in educational contexts shows promise in enhancing evaluation processes, providing insights into quality assessment, and potentially streamlining the feedback mechanisms for AI-generated content.
Key Applications
LLMs as evaluators for task quality assessment
Context: Text quality evaluation in various natural language generation tasks, including summarization and question answering.
Implementation: LLMs are prompted with varying levels of detail about the evaluation criteria, from simple prompts to detailed rubrics.
Outcomes: High correlation between LLM and human judgments in text quality evaluation, particularly with simpler prompts for powerful models.
Challenges: Determining the most effective prompting strategy and the inherent biases in LLMs favoring their own outputs.
Implementation Barriers
Technical
Bias in LLMs preferring their outputs over others can distort evaluation results.
Proposed Solutions: Use perplexity as an unbiased metric for evaluation, or refine prompt engineering to minimize bias.
Methodological
Challenges in identifying the appropriate LLM and level of information needed for accurate evaluations.
Proposed Solutions: Develop clearer guidelines for when to use detailed prompts versus simpler instructions based on task complexity.
Project Team
Bhuvanashree Murugadoss
Researcher
Christian Poelitz
Researcher
Ian Drosos
Researcher
Vu Le
Researcher
Nick McKenna
Researcher
Carina Suzana Negreanu
Researcher
Chris Parnin
Researcher
Advait Sarkar
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos, Vu Le, Nick McKenna, Carina Suzana Negreanu, Chris Parnin, Advait Sarkar
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai