Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction

Project Overview

The document explores the application of generative AI in education, specifically focusing on the creation of Knowledge Base Question Answering (KBQA), Machine Reading Comprehension (MRC), and Information Retrieval (IR) datasets for low-resource languages, exemplified by the development of the PUGG dataset for Polish. By employing Large Language Models (LLMs), the approach significantly reduces the human workload associated with dataset construction, enabling the generation of both natural and template-based questions. The findings underscore the challenges in building effective datasets for languages with limited resources and highlight the role of knowledge graphs in enhancing the accuracy and relevance of answers provided by question-answering systems. The document ultimately demonstrates how generative AI can streamline educational resource development, making advanced AI applications more accessible and effective for diverse linguistic contexts.

Key Applications

PUGG dataset for KBQA, MRC, and IR tasks

Context: Educational context focusing on low-resource languages; target audience includes researchers and developers in NLP.

Implementation: A semi-automated pipeline was created to generate datasets from existing QA datasets and Wikipedia, with verification from human annotators.

Outcomes: Creation of the first Polish KBQA dataset, substantial reduction in human annotation workload, and establishment of benchmarks for future research.

Challenges: Limited availability of pre-trained models for low-resource languages, ensuring the naturalness of generated questions, and the need for human verification to maintain quality.

Implementation Barriers

Technical Barrier

Lack of robust tools or models for entity linking in low-resource languages like Polish, leading to challenges in accurately identifying topic entities.

Proposed Solutions: Development of a heuristic method tailored for entity linking requirements, utilizing Wikipedia search engine for entity identification.

Resource Barrier

Limited availability of datasets and pre-trained models for low-resource languages, which hampers the development of NLP applications.

Proposed Solutions: Utilizing modern tools like LLMs to assist in data annotation and construction of datasets.

Project Team

Albert Sawczyn

Researcher

Katsiaryna Viarenich

Researcher

Konrad Wojtasik

Researcher

Aleksandra Domogała

Researcher

Marcin Oleksy

Researcher

Maciej Piasecki

Researcher

Tomasz Kajdanowicz

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Albert Sawczyn, Katsiaryna Viarenich, Konrad Wojtasik, Aleksandra Domogała, Marcin Oleksy, Maciej Piasecki, Tomasz Kajdanowicz

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

← Back to Projects