Skip to main content Skip to navigation

Multi-dimensional data refining strategy for effective fine-tuning LLMs

Project Overview

The document explores the application of generative AI in education, specifically through the development and fine-tuning of large language models (LLMs) for the Vietnamese language. It addresses the challenges associated with linguistic nuances and the necessity of high-quality data for effective model training. A key highlight is the creation of the VN-BLOOM-7B1 model, which demonstrates the ability to generate coherent news articles, illustrating the potential of AI to assist in data collection and curation. Additionally, the document outlines strategies to leverage generative AI tools in enhancing educational resources, emphasizing their role in improving language learning and accessibility. The findings suggest that generative AI not only facilitates the development of more effective language models but also enhances educational outcomes by providing tailored content and learning experiences. Overall, the integration of generative AI in education shows promise in addressing linguistic challenges and improving the quality of language education.

Key Applications

VN-BLOOM-7B1 model for generating Vietnamese news articles

Context: Natural Language Processing for Vietnamese language, targeting researchers and practitioners in NLP

Implementation: Fine-tuning a pre-trained LLM using a dataset created from existing English datasets and Vietnamese-language sources.

Outcomes: The model generates coherent and contextually relevant news articles in Vietnamese, demonstrating high quality and human-like writing.

Challenges: Data scarcity, linguistic diversity, and the need for human curation in dataset preparation.

Implementation Barriers

Data Scarcity

Limited availability of high-quality Vietnamese language data for training models, along with challenges in accessing and crawling data from websites due to rate limits and structure variability.

Proposed Solutions: Leveraging existing datasets in English and translating them, developing AI-assisted data crawling scripts, and using generative AI tools like ChatGPT to create adaptable data crawling scripts.

Project Team

Thanh Nguyen Ngoc

Researcher

Quang Nhat Tran

Researcher

Arthur Tang

Researcher

Bao Nguyen

Researcher

Thuy Nguyen

Researcher

Thanh Pham

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Thanh Nguyen Ngoc, Quang Nhat Tran, Arthur Tang, Bao Nguyen, Thuy Nguyen, Thanh Pham

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies