Skip to main content Skip to navigation

LLM-as-a-Fuzzy-Judge: Fine-Tuning Large Language Models as a Clinical Evaluation Judge with Fuzzy Logic

Project Overview

The document explores the application of a novel framework known as LLM-as-a-Fuzzy-Judge in the realm of education, specifically for evaluating medical students' clinical communication skills using generative AI. It addresses the limitations of conventional assessment methods, which often rely on binary evaluations that fail to reflect the complexities of human judgment. By integrating large language models (LLMs) with fuzzy logic, the framework effectively captures the subtleties and subjectivity inherent in clinical evaluations. This approach not only enhances the accuracy of assessments—achieving over 80% precision based on fuzzy criteria like professionalism and medical relevance—but also aligns AI-driven evaluations more closely with human evaluators' insights. The findings underscore the potential of generative AI in transforming educational assessment by providing a more nuanced, flexible, and reliable means of evaluating student performance, thereby potentially improving educational outcomes in medical training.

Key Applications

LLM-as-a-Fuzzy-Judge

Context: Medical education, targeting medical students practicing clinical communication skills through simulations.

Implementation: Fine-tuning large language models with human annotations based on fuzzy criteria for evaluating student-AI patient conversations.

Outcomes: Achieved over 80% accuracy in assessments, improved alignment with human evaluators, and provided nuanced feedback.

Challenges: Quality and diversity of annotated data, resource-intensive expert evaluation, and variability in human judgment.

Implementation Barriers

Data Quality

The quality and diversity of annotated data directly impact model performance.

Proposed Solutions: Expand the diversity and scale of annotated datasets.

Resource Intensity

Expert annotation is resource-intensive and can lead to variability in evaluation.

Proposed Solutions: Integrate additional sources of human feedback and explore advanced alignment techniques.

Bias and Limitations

Inherent biases and limitations of the underlying LLM can constrain model performance.

Proposed Solutions: Incorporate additional fuzzy criteria and domain-specific dimensions.

Project Team

Weibing Zheng

Researcher

Laurah Turner

Researcher

Jess Kropczynski

Researcher

Murat Ozer

Researcher

Tri Nguyen

Researcher

Shane Halse

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Weibing Zheng, Laurah Turner, Jess Kropczynski, Murat Ozer, Tri Nguyen, Shane Halse

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies