Leveraging Retrieval-Augmented Generation for Persian University Knowledge Retrieval
Project Overview
The document explores the application of Retrieval-Augmented Generation (RAG) pipelines utilizing Large Language Models (LLMs) to enhance information retrieval and query response systems in education, particularly for Persian-language data. Through the development of a comprehensive dataset, UniversityQuestionBench (UQB), the study evaluated the performance of these systems, revealing notable advancements in the precision and relevance of responses to university-related inquiries. It underscores the challenges that LLMs face in handling localized data and demonstrates how RAG effectively mitigates these issues, ultimately contributing to improved educational tools and resources. The findings indicate that integrating generative AI in educational contexts can significantly enhance user experiences and information accessibility, paving the way for more tailored and effective learning support systems.
Key Applications
Retrieval-Augmented Generation (RAG) pipeline using Persian Large Language Models (PLMs)
Context: University-related question answering for students at the University of Isfahan
Implementation: Development of a two-stage RAG approach combined with a Persian Large Language Model and advanced prompt engineering techniques. Data was extracted from the university website and student surveys were conducted.
Outcomes: Improved precision and relevance of responses, enhanced user experience, and reduced time for obtaining answers.
Challenges: Challenges include the complexity of integrating retrieval mechanisms with generation models and the need for extensive fine-tuning on specific datasets.
Implementation Barriers
Technical Barrier
Difficulty in integrating retrieval mechanisms with generation models and ensuring scalability.
Proposed Solutions: Continued innovation in RAG frameworks and methodologies to optimize retrieval and generation processes.
Data Barrier
LLMs often struggle to access and utilize localized data effectively, leading to inaccuracies.
Proposed Solutions: Development of domain-specific datasets like UQB and the use of advanced prompt engineering techniques.
Project Team
Arshia Hemmat
Researcher
Kianoosh Vadaei
Researcher
Mohammad Hassan Heydari
Researcher
Afsaneh Fatemi
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Arshia Hemmat, Kianoosh Vadaei, Mohammad Hassan Heydari, Afsaneh Fatemi
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai