From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation
Project Overview
The document explores the role of generative AI, particularly large language models (LLMs) such as Gemma and Mistral, in enhancing educational outcomes through improved representation of low-resource languages like Ukrainian. It addresses the challenges that these languages face in the realm of AI, emphasizing the necessity for inclusivity in digital technology. The creation of specialized datasets, such as the Ukrainian Knowledge and Instruction Dataset (UKID), is underscored as vital for training language-specific models that cater to educational needs. The authors assert that these advancements in generative AI are crucial not only for educational purposes but also for cultural preservation and the reduction of language bias in AI applications. By focusing on the development and fine-tuning of these models, the document illustrates how generative AI can contribute to a more equitable and effective educational landscape, ultimately fostering a better understanding and utilization of diverse languages in the digital age.
Key Applications
Language-Specific LLM for Educational Purposes
Context: Primary and secondary education in Ukraine, specifically targeting Ukrainian language learners and educators.
Implementation: Utilizing and fine-tuning LLMs (such as Gemma and Mistral) with Ukrainian datasets, including the UKID, to generate educational materials and assist in language learning.
Outcomes: ['Enhanced understanding of Ukrainian heritage and language among students.', 'Improved linguistic proficiency and contextual understanding in Ukrainian language tasks.']
Challenges: ['Limited datasets for training, leading to inconsistencies and errors in understanding Ukrainian context.', 'Risk of linguistic identity crisis without tailored AI tools; potential loss of cultural heritage.']
Implementation Barriers
Technical and Cultural Barrier
Lack of suitable datasets for fine-tuning LLMs in underrepresented languages like Ukrainian, along with cultural erosion and bias in AI models favoring dominant languages, leading to underrepresentation of Ukrainian language and culture.
Proposed Solutions: Creation of specific datasets like UKID to cater to the linguistic needs of Ukrainian speakers, and investing in language-specific AI model development to prevent cultural bias and promote inclusivity.
Project Team
Artur Kiulian
Researcher
Anton Polishko
Researcher
Mykola Khandoga
Researcher
Oryna Chubych
Researcher
Jack Connor
Researcher
Raghav Ravishankar
Researcher
Adarsh Shirawalmath
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Artur Kiulian, Anton Polishko, Mykola Khandoga, Oryna Chubych, Jack Connor, Raghav Ravishankar, Adarsh Shirawalmath
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai