LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages
Project Overview
The document explores the application of Large Language Models (LLMs), particularly GPT-4-Turbo, in educational contexts, specifically for data annotation tasks related to low-resource languages. It emphasizes the advantages of using LLMs for Named Entity Recognition (NER), showcasing their cost-effectiveness in comparison to traditional human annotation methods, which results in substantial cost savings while still achieving relatively high accuracy. Despite these benefits, the study acknowledges challenges such as token skipping and formatting errors that can arise from LLM outputs. To address these issues, an active learning approach is proposed to improve the efficiency and effectiveness of data annotation. The overall findings indicate that LLMs can significantly enhance natural language processing capabilities for low-resource languages, offering a promising alternative for automated data labeling in educational settings and potentially expanding access to language resources.
Key Applications
Using LLMs for Named Entity Recognition (NER) in low-resource languages.
Context: Educational context focusing on low-resource African languages, targeting researchers and practitioners in NLP.
Implementation: Integration of LLMs like GPT-4-Turbo into an active learning framework for data annotation.
Outcomes: Achieved near-state-of-the-art performance in NER with significantly reduced data requirements and estimated cost savings of at least 42.45 times compared to human annotation.
Challenges: Challenges include token skipping, formatting errors, and variability in LLM performance across languages.
Implementation Barriers
Technical barrier
Token skipping and incorrect formatting in LLM outputs can lead to inaccurate annotations.
Proposed Solutions: Implementing advanced prompt engineering and further training of LLMs to improve reliability.
Resource barrier
Limited linguistic resources and expertise for data labeling in low-resource languages.
Proposed Solutions: Using LLMs to automate and reduce the costs associated with data labeling.
Project Team
Nataliia Kholodna
Researcher
Sahib Julka
Researcher
Mohammad Khodadadi
Researcher
Muhammed Nurullah Gumus
Researcher
Michael Granitzer
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Nataliia Kholodna, Sahib Julka, Mohammad Khodadadi, Muhammed Nurullah Gumus, Michael Granitzer
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai