GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data
Project Overview
The document highlights the development of GemMaroc, a large language model designed for Moroccan Arabic (Darija) that leverages generative AI to enhance language proficiency with limited data resources. It underscores the effectiveness of fine-tuning LLMs using high-quality, minimal datasets, showing that a carefully curated set of a few thousand instructions can significantly improve performance and reasoning capabilities. GemMaroc aims to promote inclusive education and improve public services in Morocco by offering an accessible and sustainable language technology solution. Overall, the findings indicate that generative AI can effectively address language learning needs while providing cost-efficient methods for educational advancements in diverse linguistic contexts.
Key Applications
GemMaroc-27B
Context: Educational applications for Moroccan Arabic speakers, aiming to improve language proficiency and support digital inclusion.
Implementation: Utilized a fine-tuning approach on the GEMMA model with minimal data and reasoning-dense prompts, focusing on efficiency and ecological sustainability.
Outcomes: Achieved a DarijaMMLU score of 61.6%, surpassing existing models while using significantly less data and energy.
Challenges: Limited data availability for sentiment analysis and summarization tasks, and reliance on machine-translated Darija samples with limited verification.
Implementation Barriers
Data Availability
Limited availability of high-quality data for low-resource dialects like Darija hampers effective model training.
Proposed Solutions: Utilizing minimal-data alignment strategies and careful dataset curation to maximize the impact of smaller datasets.
Technical Infrastructure
The computational demands of training large language models can be prohibitive, especially for low-resource settings.
Proposed Solutions: Implementing efficient training techniques like LoRA adapters to reduce resource consumption and carbon footprint.
Project Team
Abderrahman Skiredj
Researcher
Ferdaous Azhari
Researcher
Houdaifa Atou
Researcher
Nouamane Tazi
Researcher
Ismail Berrada
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Abderrahman Skiredj, Ferdaous Azhari, Houdaifa Atou, Nouamane Tazi, Ismail Berrada
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai