Skip to main content Skip to navigation

Prompting Science Report 1: Prompt Engineering is Complicated and Contingent

Project Overview

The document explores the multifaceted role of generative AI, particularly Large Language Models (LLMs), in education, focusing on their performance measurement complexities. It underscores the lack of a standardized assessment for LLM effectiveness, revealing that their performance is influenced by various prompting techniques and that rigorous testing is essential for accurate evaluation. The findings indicate that the politeness of prompts can unpredictably affect outcomes, suggesting that traditional benchmarking methods may not reliably reflect model performance. By identifying these variables, the research aims to provide educational and policy leaders with a nuanced understanding of AI's potential applications in educational settings, ultimately guiding them in harnessing AI technologies to enhance learning experiences while being aware of their limitations.

Key Applications

Generative AI for educational assessments

Context: Used in the context of graduate-level assessments across subjects like biology, physics, and chemistry.

Implementation: The AI models were tested using the GPQA Diamond dataset with varied prompting conditions (polite, commanding, formatted, unformatted). Each question was asked 100 times to assess consistency.

Outcomes: The study found substantial performance variability and indicated that certain prompts could improve performance depending on the question.

Challenges: Inconsistencies in responses from AI models and the lack of a standard measure for AI performance can lead to misleading conclusions.

Implementation Barriers

Technical Barrier

Variability in AI performance based on different prompting and testing conditions.

Proposed Solutions: Implementing rigorous testing with repeated trials to capture performance variability more accurately.

Measurement Barrier

No single standard for measuring AI performance, which may lead to potential overestimation of reliability across various contexts and applications.

Proposed Solutions: Developing clearer standards for AI performance evaluation that account for various contexts and applications.

Project Team

Lennart Meincke

Researcher

Ethan Mollick

Researcher

Lilach Mollick

Researcher

Dan Shapiro

Researcher

Contact Information

For information about the paper, please contact the authors.

Authors: Lennart Meincke, Ethan Mollick, Lilach Mollick, Dan Shapiro

Source Publication: View Original PaperLink opens in a new window

Project Contact: Dr. Jianhua Yang

LLM Model Version: gpt-4o-mini-2024-07-18

Analysis Provider: Openai

Let us know you agree to cookies