Skip to main content Skip to navigation

The BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues

Project Overview

The document explores the BEA 2023 Shared Task, which evaluated the capabilities of generative AI in simulating teacher responses within educational dialogues. Eight teams participated, employing various state-of-the-art models to engage in teacher-student interactions. Among these, the NAISTeacher system emerged as the top performer in both automated and human evaluations, highlighting the potential of large language models (LLMs) in educational settings. However, the findings revealed that current evaluation metrics are inadequate in capturing the pedagogical effectiveness of these AI systems. This underscores both the promise and the challenges posed by generative AI in education, emphasizing the need for improved assessment criteria to better reflect the instructional capabilities of AI teachers. Overall, the document illustrates the evolving role of generative AI in education, showcasing innovative applications while also addressing the complexities and limitations that accompany its integration into learning environments.

Key Applications

NAISTeacher system using GPT-3.5

Context: Educational dialogues, specifically for language learning with ESL students.

Implementation: Participated in a shared task where teams submitted AI-generated teacher responses based on dialogue contexts from the Teacher-Student Chatroom Corpus.

Outcomes: Ranked first in both automated and human evaluations, demonstrating effective pedagogical ability.

Challenges: Need for better evaluation metrics suited to educational contexts; some responses preferred by metrics did not exhibit teacher-like qualities.

Implementation Barriers

Evaluation Metrics

Existing automated metrics like BERTScore and DialogRPT are not capable of assessing the pedagogical abilities of AI responses, leading to challenges in accurately evaluating AI teachers.

Proposed Solutions: Develop more accurate and domain-specific automated metrics that can reward pedagogical skills.

Data Limitations

The limitations of the Teacher-Student Chatroom Corpus, such as the 100-token cap on dialogues, may hinder the training and evaluation process.

Proposed Solutions: Reconsider the data sampling methods and dialogue structure to improve data quality.

Project Team

Anaïs Tack

Researcher

Ekaterina Kochmar

Researcher

Zheng Yuan

Researcher

Serge Bibauw

Researcher

Chris Piech

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Anaïs Tack, Ekaterina Kochmar, Zheng Yuan, Serge Bibauw, Chris Piech

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies