An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education
Project Overview
This document explores the application of generative AI in education, specifically focusing on two open-source embedding models designed to enhance semantic retrieval for educational purposes, particularly in question answering related to course syllabi. These models have been fine-tuned using a synthetic dataset to effectively navigate the complexities of academic language, demonstrating superior performance in precision and recall compared to existing models tailored for educational contexts. The findings underscore the necessity for AI systems to be adapted to the specific needs of academic environments, highlighting the potential of AI-driven tools to improve access to educational content. Overall, the research advocates for the integration of advanced AI technologies in education to facilitate more effective learning and information retrieval, ultimately enriching the educational experience.
Key Applications
Open-source embedding models for semantic retrieval in educational contexts
Context: Higher education institutions, specifically for course syllabi and academic question answering
Implementation: Developed two models using open-source architectures and fine-tuned them with a synthetic dataset tailored to academic discourse
Outcomes: Both models demonstrated superior performance over baseline models, improving semantic precision in educational retrieval tasks.
Challenges: General-purpose models struggle with academic semantics without targeted adaptations. Existing models may lack transparency and introduce vendor lock-in.
Implementation Barriers
Technical Barrier
Existing general-purpose embedding models do not effectively capture academic semantics due to their training on heterogeneous data. This limitation necessitates the development of domain-specific embeddings.
Proposed Solutions: Develop domain-specific embeddings through fine-tuning and leveraging synthetic datasets that reflect academic language nuances.
Institutional Barrier
Proprietary models present challenges related to cost, transparency, and data governance, limiting their adoption in public education. This creates a need for affordable and transparent alternatives.
Proposed Solutions: Advocate for open-source alternatives that provide transparency and lower costs for educational institutions.
Project Team
Ramteja Sajja
Researcher
Yusuf Sermet
Researcher
Ibrahim Demir
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Ramteja Sajja, Yusuf Sermet, Ibrahim Demir
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai