Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users
Project Overview
The document explores the role of Multimodal Large Language Models (MLLMs) in education, particularly focusing on their application as visual assistants for individuals with visual impairments. It details user adoption patterns and the challenges users face, such as issues related to accuracy, cultural sensitivity, and contextual understanding, which hinder the effectiveness of these AI tools. Key applications include Optical Braille Recognition, which enables the recognition of Braille from images, and video question answering, where AI generates questions based on video content created by visually impaired users. The findings underscore a strong willingness among users to embrace AI for visual assistance, yet highlight significant barriers that need to be addressed to improve the reliability and usability of these technologies in educational contexts. The document calls for further advancements in MLLMs to enhance their performance in accessibility applications, ensuring they meet the diverse needs of users effectively.
Key Applications
Optical Braille Recognition
Context: Assistive technology for visually impaired individuals, focusing on educational tools that transcribe Braille text rendered in images into plain English, thereby improving accessibility to written content for learners who use Braille.
Implementation: AI models process augmented images of Braille text to transcribe it into plain English, utilizing datasets for Braille-to-text transcription and enhancing capabilities through systematic evaluations.
Outcomes: Improved accessibility to educational materials for visually impaired individuals, with enhanced recognition capabilities demonstrated by models such as Qwen2-VL-Instruct, although models still require high accuracy to avoid hallucinating incorrect information.
Challenges: Models often lack Braille recognition capabilities, and there are ongoing challenges with training data quality and generalization abilities, requiring improvements to ensure high accuracy in recognizing Braille.
Video Question Answering
Context: Assistive technology for visually impaired individuals that utilizes videos recorded by visually impaired users to generate descriptive, spatial, and adversarial questions, enhancing their understanding of video content.
Implementation: A curated dataset of videos filmed by visually impaired users is employed to create a variety of questions that help improve interaction with video content; this includes systematic evaluations of models on their performance in answering these questions.
Outcomes: Enhanced interaction with video content for visually impaired users by providing context and details through generated questions, although varying performance across models has been observed with challenges in answering adversarial questions.
Challenges: Models struggle to acknowledge uncertainty, often leading to hallucinations and misinformation, particularly when generating responses to adversarial questions, raising safety concerns.
Video Object Recognition
Context: Identifying objects from videos recorded by visually impaired individuals, with a focus on recognizing both generic objects and assistive items crucial for blind and low-vision (BLV) users.
Implementation: Evaluated models on their ability to identify objects in videos, assessing their performance and accuracy in recognizing important assistive items.
Outcomes: Moderate performance on generic object categories, but significant struggles with recognizing assistive objects have been noted, indicating a need for improved specificity and accuracy.
Challenges: Limited specificity in recognizing assistive objects, which are essential for BLV users, highlights the need for better training and model development to address this gap.
Implementation Barriers
Technical Barrier
Inaccuracies in AI model outputs, including hallucinations and misleading information. AI models may hallucinate answers, especially for questions that require inference beyond the provided video content.
Proposed Solutions: Need for comprehensive evaluations and improvements in training data quality. Implement explicit prompting strategies to guide models in providing responses such as 'Not enough information' when applicable.
Cultural Barrier
Difficulty in recognizing and understanding cultural nuances in visual content.
Proposed Solutions: Development of culturally aware datasets and models that incorporate user feedback.
Multilingual Barrier
Lack of effective multilingual support for visual question answering.
Proposed Solutions: Creating multilingual datasets and models that are robust across different languages.
Usability Barrier
Challenges in trust and reliability, as users are hesitant to rely on AI for critical tasks.
Proposed Solutions: Engaging with visually impaired users for continuous feedback and iterative improvements.
Data Limitations
Limited datasets for training AI models on tasks specific to visually impaired users, particularly for video question answering.
Proposed Solutions: Curate more diverse datasets that reflect the experiences and needs of visually impaired individuals.
Project Team
Antonia Karamolegkou
Researcher
Malvina Nikandrou
Researcher
Georgios Pantazopoulos
Researcher
Danae Sanchez Villegas
Researcher
Phillip Rust
Researcher
Ruchira Dhar
Researcher
Daniel Hershcovich
Researcher
Anders Søgaard
Researcher
Contact Information
For information about the paper, please contact the authors.
Authors: Antonia Karamolegkou, Malvina Nikandrou, Georgios Pantazopoulos, Danae Sanchez Villegas, Phillip Rust, Ruchira Dhar, Daniel Hershcovich, Anders Søgaard
Source Publication: View Original PaperLink opens in a new window
Project Contact: Dr. Jianhua Yang
LLM Model Version: gpt-4o-mini-2024-07-18
Analysis Provider: Openai