Leveraging Retrieval-Augmented Generation for Persian University Knowledge Retrieval

Project Overview

The document explores the application of Retrieval-Augmented Generation (RAG) pipelines utilizing Large Language Models (LLMs) to enhance information retrieval and query response systems in education, particularly for Persian-language data. Through the development of a comprehensive dataset, UniversityQuestionBench (UQB), the study evaluated the performance of these systems, revealing notable advancements in the precision and relevance of responses to university-related inquiries. It underscores the challenges that LLMs face in handling localized data and demonstrates how RAG effectively mitigates these issues, ultimately contributing to improved educational tools and resources. The findings indicate that integrating generative AI in educational contexts can significantly enhance user experiences and information accessibility, paving the way for more tailored and effective learning support systems.

Key Applications

Retrieval-Augmented Generation (RAG) pipeline using Persian Large Language Models (PLMs)

Context: University-related question answering for students at the University of Isfahan

Implementation: Development of a two-stage RAG approach combined with a Persian Large Language Model and advanced prompt engineering techniques. Data was extracted from the university website and student surveys were conducted.

Outcomes: Improved precision and relevance of responses, enhanced user experience, and reduced time for obtaining answers.

Challenges: Challenges include the complexity of integrating retrieval mechanisms with generation models and the need for extensive fine-tuning on specific datasets.

Implementation Barriers

Technical Barrier

Difficulty in integrating retrieval mechanisms with generation models and ensuring scalability.

Proposed Solutions: Continued innovation in RAG frameworks and methodologies to optimize retrieval and generation processes.

Data Barrier

LLMs often struggle to access and utilize localized data effectively, leading to inaccuracies.

Proposed Solutions: Development of domain-specific datasets like UQB and the use of advanced prompt engineering techniques.

Project Team

Arshia Hemmat

Researcher

Kianoosh Vadaei

Researcher

Mohammad Hassan Heydari

Researcher

Afsaneh Fatemi

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Arshia Hemmat, Kianoosh Vadaei, Mohammad Hassan Heydari, Afsaneh Fatemi

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

← Back to Projects