Skip to main content Skip to navigation

An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education

Project Overview

This document explores the application of generative AI in education, specifically focusing on two open-source embedding models designed to enhance semantic retrieval for educational purposes, particularly in question answering related to course syllabi. These models have been fine-tuned using a synthetic dataset to effectively navigate the complexities of academic language, demonstrating superior performance in precision and recall compared to existing models tailored for educational contexts. The findings underscore the necessity for AI systems to be adapted to the specific needs of academic environments, highlighting the potential of AI-driven tools to improve access to educational content. Overall, the research advocates for the integration of advanced AI technologies in education to facilitate more effective learning and information retrieval, ultimately enriching the educational experience.

Key Applications

Open-source embedding models for semantic retrieval in educational contexts

Context: Higher education institutions, specifically for course syllabi and academic question answering

Implementation: Developed two models using open-source architectures and fine-tuned them with a synthetic dataset tailored to academic discourse

Outcomes: Both models demonstrated superior performance over baseline models, improving semantic precision in educational retrieval tasks.

Challenges: General-purpose models struggle with academic semantics without targeted adaptations. Existing models may lack transparency and introduce vendor lock-in.

Implementation Barriers

Technical Barrier

Existing general-purpose embedding models do not effectively capture academic semantics due to their training on heterogeneous data. This limitation necessitates the development of domain-specific embeddings.

Proposed Solutions: Develop domain-specific embeddings through fine-tuning and leveraging synthetic datasets that reflect academic language nuances.

Institutional Barrier

Proprietary models present challenges related to cost, transparency, and data governance, limiting their adoption in public education. This creates a need for affordable and transparent alternatives.

Proposed Solutions: Advocate for open-source alternatives that provide transparency and lower costs for educational institutions.

Project Team

Ramteja Sajja

Researcher

Yusuf Sermet

Researcher

Ibrahim Demir

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Ramteja Sajja, Yusuf Sermet, Ibrahim Demir

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies