Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction
Project Overview
The document explores the application of generative AI in education, specifically focusing on the creation of Knowledge Base Question Answering (KBQA), Machine Reading Comprehension (MRC), and Information Retrieval (IR) datasets for low-resource languages, exemplified by the development of the PUGG dataset for Polish. By employing Large Language Models (LLMs), the approach significantly reduces the human workload associated with dataset construction, enabling the generation of both natural and template-based questions. The findings underscore the challenges in building effective datasets for languages with limited resources and highlight the role of knowledge graphs in enhancing the accuracy and relevance of answers provided by question-answering systems. The document ultimately demonstrates how generative AI can streamline educational resource development, making advanced AI applications more accessible and effective for diverse linguistic contexts.
Key Applications
PUGG dataset for KBQA, MRC, and IR tasks
Context: Educational context focusing on low-resource languages; target audience includes researchers and developers in NLP.
Implementation: A semi-automated pipeline was created to generate datasets from existing QA datasets and Wikipedia, with verification from human annotators.
Outcomes: Creation of the first Polish KBQA dataset, substantial reduction in human annotation workload, and establishment of benchmarks for future research.
Challenges: Limited availability of pre-trained models for low-resource languages, ensuring the naturalness of generated questions, and the need for human verification to maintain quality.
Implementation Barriers
Technical Barrier
Lack of robust tools or models for entity linking in low-resource languages like Polish, leading to challenges in accurately identifying topic entities.
Proposed Solutions: Development of a heuristic method tailored for entity linking requirements, utilizing Wikipedia search engine for entity identification.
Resource Barrier
Limited availability of datasets and pre-trained models for low-resource languages, which hampers the development of NLP applications.
Proposed Solutions: Utilizing modern tools like LLMs to assist in data annotation and construction of datasets.
Project Team
Albert Sawczyn
Researcher
Katsiaryna Viarenich
Researcher
Konrad Wojtasik
Researcher
Aleksandra Domogała
Researcher
Marcin Oleksy
Researcher
Maciej Piasecki
Researcher
Tomasz Kajdanowicz
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai