March 11, 2026

OpenAI Releases IH-Challenge Dataset, Major Advance in AI Safety and Instruction Hierarchy

In a significant development for AI safety, OpenAI announced the IH-Challenge on March 10, 2026, a new reinforcement learning training dataset designed to enhance instruction hierarchy (IH) in frontier large language models (LLMs). IH establishes a trust-ordered policy for resolving conflicting instructions, prioritizing system messages over developer, user, and tool messages. This initiative addresses critical vulnerabilities such as jailbreaks, system prompt extraction, and prompt injection attacks, which have posed ongoing challenges to AI alignment and security.

The IH-Challenge dataset consists of tasks that are instruction-following simple, objectively gradable via Python scripts, and free of trivial shortcuts. It features conversations with conflicting instructions from high-privilege roles (e.g., system or developer) and lower-privilege roles (e.g., user or tool). Created through an offline phase of synthesizing task skeletons and an online phase using a frozen attacker LLM for adversarial low-priority prompts during RL training, the dataset avoids pitfalls like confounded instruction-following failures and shortcut learning such as overrefusal.

Fine-tuning frontier LLMs like GPT-5-Mini on IH-Challenge produced GPT-5-Mini-R, yielding substantial improvements. IH robustness increased by 10.0 percentage points on average across 16 benchmarks (from 84.1% to 94.1%), with gains in held-out attacks, human red-teaming (63.8% to 88.2%), and academic evaluations like Gandalf Password and TensorTrust. Safety steerability enhanced, reducing unsafe behavior from 6.6% to 0.7% on OpenAI’s Production Benchmarks when safety specs are in system prompts, while preserving helpfulness.

Prompt injection robustness also saw major boosts, saturating OpenAI’s internal agentic evaluations (44% to 100%) and improving on CyberSecEval 2 (88% to 91%). Models better ignore malicious tool instructions, with no significant capability regressions on benchmarks like GPQA Diamond, AIME 2024, or chat win-rates. Overrefusal issues were mitigated, as shown by improvements on IH-Challenge tasks (+21%).

This breakthrough demonstrates that well-designed IH training generalizes to realistic scenarios, providing defense-in-depth for safety without relying on less effective mitigations like output monitors. OpenAI has open-sourced the IH-Challenge dataset on Hugging Face, enabling broader adoption to fortify agentic AI systems against evolving threats.

Read Research Source →