Optical Character Recognition and Transcription of Berber Signs from Images in a Low-Resource Language Amazigh

Project Overview

The document explores the innovative use of generative AI in education through the development of DaToBS, an advanced approach for Optical Character Recognition (OCR) and transcription specifically designed for the Berber language's Tifinagh script. It addresses the significant challenges faced by low-resource languages, such as Amazigh, in accessing modern educational technologies. The implementation of DaToBS demonstrates impressive results, achieving over 92% accuracy in recognizing Berber characters, which holds promise for various educational applications, including language acquisition software. The research underscores the importance of enhancing representation and accessibility for low-resource languages in technological contexts, ultimately aiming to improve educational outcomes and support the preservation and promotion of linguistic diversity in the digital age.

Key Applications

DaToBS: Detection and Transcription of Berber Signs

Context: Educational context focusing on language acquisition and technology training for the Berber-speaking community, including children and travelers.

Implementation: Developed a self-created corpus of 1862 character images from natural scene photographs, annotated for training a CNN model (VGG-16).

Outcomes: Achieved over 92% accuracy in recognizing and transcribing Tiﬁnagh characters from images, enabling potential development of educational apps and resources for Berber language learners.

Challenges: Dealing with the variability in character presentation in natural environments compared to handwriting, as well as the limited existing datasets for Berber.

Implementation Barriers

Resource Limitation

Low-resource languages like Amazigh lack adequate datasets and technological resources compared to high-resource languages.

Proposed Solutions: Creating self-generated corpora and collaborating with local communities to gather more data.

Technical Challenges

Variability in character representation in real-world images complicates OCR accuracy.

Proposed Solutions: Utilizing advanced deep learning models like CNNs specifically adapted for the unique characteristics of the Tifinagh script.

Project Team

Levi Corallo

Researcher

Aparna S. Varde

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Levi Corallo, Aparna S. Varde

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

← Back to Projects