Skip to main content Skip to navigation

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Project Overview

The document examines the role of generative AI, specifically large language models (LLMs), in advancing educational technologies, particularly in post-automatic speech recognition (ASR) tasks. It introduces the GenSEC challenge, which targets improvements in transcription correction, speaker tagging, and emotion recognition by leveraging the contextual understanding of LLMs to refine ASR outputs. By addressing these critical areas, the challenge seeks to enhance the accuracy and effectiveness of speech processing technologies in educational settings, ultimately fostering innovation and setting new benchmarks in the field. The findings underscore the transformative potential of generative AI in education, where improved ASR capabilities can lead to better learning experiences, more effective communication, and enriched engagement in diverse learning environments. Overall, the document highlights the significant impact of LLMs on educational practices, promising advancements in how speech recognition technologies are integrated into learning and communication processes.

Key Applications

Post-ASR Output Enhancement

Context: Improving the quality of automatic speech recognition outputs through the use of Large Language Models for tasks including transcription accuracy correction, speaker tagging enhancement, and emotion recognition from transcribed speech.

Implementation: Participants utilize N-best hypotheses for re-ranking or generative correction, submit corrected transcripts of speaker tags, and classify emotions based on ASR-transcribed speech, leveraging conversational context to enhance the ASR outputs.

Outcomes: Initial results indicate that LLMs can significantly improve transcription accuracy, enhance speaker tagging accuracy, and produce promising results in text-based emotion recognition.

Challenges: Potential biases in LLM outputs, the variability in performance based on prompting methods, lack of robust error handling, and the need for standardization of evaluation methods for multi-speaker systems.

Implementation Barriers

Technical Barrier

Potential biases in LLMs affecting the accuracy and fairness of speech corrections.

Proposed Solutions: Need for ongoing research and evaluation methodologies to mitigate bias.

Evaluation Barrier

Lack of standardized evaluation metrics for assessing multi-speaker error correction systems.

Proposed Solutions: Establishing community standards for evaluation metrics in future challenges.

Implementation Barrier

Challenges in integrating acoustic and linguistic information effectively.

Proposed Solutions: Future evaluations should incorporate acoustic features alongside text-based methods.

Project Team

Chao-Han Huck Yang

Researcher

Taejin Park

Researcher

Yuan Gong

Researcher

Yuanchao Li

Researcher

Zhehuai Chen

Researcher

Yen-Ting Lin

Researcher

Chen Chen

Researcher

Yuchen Hu

Researcher

Kunal Dhawan

Researcher

Piotr Żelasko

Researcher

Chao Zhang

Researcher

Yun-Nung Chen

Researcher

Yu Tsao

Researcher

Jagadeesh Balam

Researcher

Boris Ginsburg

Researcher

Sabato Marco Siniscalchi

Researcher

Eng Siong Chng

Researcher

Peter Bell

Researcher

Catherine Lai

Researcher

Shinji Watanabe

Researcher

Andreas Stolcke

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies