March 16, 2026

AI Models Fail to Conceal Internal Reasoning, Strengthening Safety Oversight Prospects

Researchers from OpenAI, UPenn, and NYU have unveiled a significant finding in AI safety: modern reasoning models struggle profoundly to control their chains-of-thought (CoT), the internal reasoning steps they verbalize during problem-solving. Published on arXiv as "Reasoning Models Struggle to Control their Chains of Thought," the study introduces the CoT-Control evaluation suite, a novel benchmark designed to test whether AI systems can manipulate their hidden thought processes to evade detection. This is crucial because AI safety protocols increasingly rely on monitoring these CoT outputs to detect misaligned or harmful intentions, and the ability to fake innocent reasoning could render such safeguards ineffective.

The CoT-Control suite challenges models with tasks requiring them to solve problems while adhering to strict CoT constraints, such as avoiding specific words like "chromosome" when reasoning about genetics. Models must demonstrate control by producing compliant internal reasoning while delivering correct final answers. The evaluation covers leading reasoning models, assessing factors like model size, reinforcement learning (RL) training intensity, test-time compute, and problem difficulty. Incentives to evade monitors and adversarial prompting were also tested to simulate real-world deception attempts.

Key results show a stark disparity between CoT controllability and output controllability. For instance, Anthropic's Claude Sonnet 4.5 achieved only 2.7% success in controlling its CoT but 61.9% in controlling its final output. Larger models exhibit slightly higher CoT control, but it decreases with more RL training, greater compute, and harder problems. Even when incentivized to deceive or explicitly warned of monitoring, controllability remains low, and adversarial prompts fail to boost it significantly.

This low controllability offers cautious optimism for AI safety. It validates current CoT monitoring techniques, as models cannot reliably produce misleading internal logic while succeeding externally. The study suggests that optimization pressures in training prioritize truthful reasoning over deception, making hidden thoughts a trustworthy window into model intentions. However, the mechanisms behind this limitation are not fully understood, urging ongoing vigilance.

The authors recommend that frontier AI labs routinely evaluate CoT controllability in future models to preempt potential breakthroughs in deception capabilities. As reasoning models advance, ensuring monitorability will be vital to maintaining safety alignments, potentially averting risks from undetected misbehaviors in deployed systems.
Read Research Source →
← Back to Feed