Skip to main content Skip to navigation

The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches

Project Overview

The document examines the use of generative AI, specifically through chatbot applications like EdTalk, in the educational sector, particularly in aiding users' comprehension of educational reports. It details a thorough evaluation process that combines automated metrics, human assessments, and LLM-based evaluations to gauge the effectiveness of such applications. The findings reveal challenges and inconsistencies across the different evaluation methods, underscoring the necessity for a robust evaluation framework that can adapt and improve over time. The study advocates for ongoing refinements in LLM applications to enhance their reliability and performance, ultimately aiming to better support educational stakeholders in navigating complex information. Through this exploration, the document highlights the potential of generative AI to transform educational experiences while also addressing the critical need for systematic evaluation and development in AI tools.

Key Applications

EdTalk Chatbot

Context: Educational reports navigation for a wide range of readers

Implementation: Developed using Retrieval Augmented Generation (RAG) and evaluated through automated, human, and LLM-based methods.

Outcomes: Improved insights into the chatbot's performance and areas for improvement based on various evaluation methodologies.

Challenges: Inconsistencies between automated and human evaluations, hallucination issues in responses, and the complexity of defining evaluation metrics.

Implementation Barriers

Technical

Lack of standardized evaluation criteria for LLM-based applications, leading to inconsistent results. Difficulties in correlating results from different evaluation methods (automated, human, LLM).

Proposed Solutions: Developing a comprehensive factored evaluation mechanism that combines automated, human, and LLM-based evaluations. Standardizing evaluation practices and incorporating diverse metrics based on cognitive frameworks like Bloom's Taxonomy.

Operational

High costs and time associated with conducting reliable human evaluations.

Proposed Solutions: Increasing the number of evaluators and utilizing expert and non-expert contributors to balance costs.

Methodological

Difficulties in correlating results from different evaluation methods (automated, human, LLM).

Proposed Solutions: Standardizing evaluation practices and incorporating diverse metrics based on cognitive frameworks like Bloom's Taxonomy.

Project Team

Bhashithe Abeysinghe

Researcher

Ruhan Circi

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Bhashithe Abeysinghe, Ruhan Circi

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies