Skip to main content Skip to navigation

GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data

Project Overview

The document highlights the development of GemMaroc, a large language model designed for Moroccan Arabic (Darija) that leverages generative AI to enhance language proficiency with limited data resources. It underscores the effectiveness of fine-tuning LLMs using high-quality, minimal datasets, showing that a carefully curated set of a few thousand instructions can significantly improve performance and reasoning capabilities. GemMaroc aims to promote inclusive education and improve public services in Morocco by offering an accessible and sustainable language technology solution. Overall, the findings indicate that generative AI can effectively address language learning needs while providing cost-efficient methods for educational advancements in diverse linguistic contexts.

Key Applications

GemMaroc-27B

Context: Educational applications for Moroccan Arabic speakers, aiming to improve language proficiency and support digital inclusion.

Implementation: Utilized a fine-tuning approach on the GEMMA model with minimal data and reasoning-dense prompts, focusing on efficiency and ecological sustainability.

Outcomes: Achieved a DarijaMMLU score of 61.6%, surpassing existing models while using significantly less data and energy.

Challenges: Limited data availability for sentiment analysis and summarization tasks, and reliance on machine-translated Darija samples with limited verification.

Implementation Barriers

Data Availability

Limited availability of high-quality data for low-resource dialects like Darija hampers effective model training.

Proposed Solutions: Utilizing minimal-data alignment strategies and careful dataset curation to maximize the impact of smaller datasets.

Technical Infrastructure

The computational demands of training large language models can be prohibitive, especially for low-resource settings.

Proposed Solutions: Implementing efficient training techniques like LoRA adapters to reduce resource consumption and carbon footprint.

Project Team

Abderrahman Skiredj

Researcher

Ferdaous Azhari

Researcher

Houdaifa Atou

Researcher

Nouamane Tazi

Researcher

Ismail Berrada

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Abderrahman Skiredj, Ferdaous Azhari, Houdaifa Atou, Nouamane Tazi, Ismail Berrada

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies