Multi-dimensional data refining strategy for effective fine-tuning LLMs
Project Overview
The document explores the application of generative AI in education, specifically through the development and fine-tuning of large language models (LLMs) for the Vietnamese language. It addresses the challenges associated with linguistic nuances and the necessity of high-quality data for effective model training. A key highlight is the creation of the VN-BLOOM-7B1 model, which demonstrates the ability to generate coherent news articles, illustrating the potential of AI to assist in data collection and curation. Additionally, the document outlines strategies to leverage generative AI tools in enhancing educational resources, emphasizing their role in improving language learning and accessibility. The findings suggest that generative AI not only facilitates the development of more effective language models but also enhances educational outcomes by providing tailored content and learning experiences. Overall, the integration of generative AI in education shows promise in addressing linguistic challenges and improving the quality of language education.
Key Applications
VN-BLOOM-7B1 model for generating Vietnamese news articles
Context: Natural Language Processing for Vietnamese language, targeting researchers and practitioners in NLP
Implementation: Fine-tuning a pre-trained LLM using a dataset created from existing English datasets and Vietnamese-language sources.
Outcomes: The model generates coherent and contextually relevant news articles in Vietnamese, demonstrating high quality and human-like writing.
Challenges: Data scarcity, linguistic diversity, and the need for human curation in dataset preparation.
Implementation Barriers
Data Scarcity
Limited availability of high-quality Vietnamese language data for training models, along with challenges in accessing and crawling data from websites due to rate limits and structure variability.
Proposed Solutions: Leveraging existing datasets in English and translating them, developing AI-assisted data crawling scripts, and using generative AI tools like ChatGPT to create adaptable data crawling scripts.
Project Team
Thanh Nguyen Ngoc
Researcher
Quang Nhat Tran
Researcher
Arthur Tang
Researcher
Bao Nguyen
Researcher
Thuy Nguyen
Researcher
Thanh Pham
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Thanh Nguyen Ngoc, Quang Nhat Tran, Arthur Tang, Bao Nguyen, Thuy Nguyen, Thanh Pham
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai