Skip to main content Skip to navigation

ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning

Project Overview

The document explores the use of generative AI, specifically through the ALLURE protocol, to enhance the evaluation capabilities of large language models (LLMs) in educational settings. It emphasizes the role of iterative in-context learning (ICL) in refining the accuracy of models such as GPT-4, demonstrating that while an increase in ICL examples can boost evaluation performance, the quality and relevance of these examples are essential for achieving reliable outcomes. Additionally, the protocol aims to tackle both ethical and technical challenges faced by LLMs by systematically integrating examples of failure modes, which aids in improving their overall effectiveness. The findings underscore the potential of generative AI in education, highlighting how thoughtful implementation can lead to significant advancements in the evaluation of educational content and practices.

Key Applications

LLM-based Evaluation and Summarization Assessment

Context: Used for grading educational content, including term papers, and evaluating the accuracy of medical document summarization, such as medical notes and summaries.

Implementation: Utilizes iterative in-context learning (ICL) with LLMs, such as GPT-4, to refine evaluations and scoring of text generated in educational and medical contexts, enhancing the accuracy and consistency of assessments.

Outcomes: ['Improved accuracy in evaluating educational and medical content', 'Enhanced consistency in summarization evaluations', 'Improved alignment with human ratings', 'Reduced reliance on human annotators']

Challenges: ['LLMs exhibit failure modes that can affect performance', 'Initial increases in ICL examples may decrease performance before improvement', 'Bias and inaccuracies in LLM evaluations', 'Ethical considerations in deploying LLMs for medical assessments']

Implementation Barriers

Technical

LLMs exhibit identifiable failure modes that can lead to incorrect evaluations, and potential biases in evaluations can have serious consequences in educational and medical contexts.

Proposed Solutions: Implementing protocols like ALLURE to systematically audit and improve LLM evaluations, along with using human-in-the-loop approaches and rigorous auditing to ensure ethical evaluations.

Project Team

Hosein Hasanbeig

Researcher

Hiteshi Sharma

Researcher

Leo Betthauser

Researcher

Felipe Vieira Frujeri

Researcher

Ida Momennejad

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Hosein Hasanbeig, Hiteshi Sharma, Leo Betthauser, Felipe Vieira Frujeri, Ida Momennejad

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies