March 15, 2026

OpenAI, UPenn, and NYU Researchers Reveal AI Models Fail to Hide Internal Reasoning, Boosting Safety Monitorability

In a significant advancement for AI safety, researchers from OpenAI, University of Pennsylvania (UPenn), New York University (NYU), and other institutions have demonstrated that current reasoning AI models struggle to control or conceal their chains of thought (CoT). Published on arXiv on March 5, 2026, the paper "Reasoning Models Struggle to Control their Chains of Thought" introduces the CoT-Control evaluation suite, comprising 14,076 tasks to test whether models can manipulate their internal reasoning processes while producing desired outputs. This work builds on prior research like "Chain of Thought Monitorability" from 2025, stress-testing the ability of models to evade safety monitoring.

The study evaluated leading models including Claude Sonnet 4.5, GPT-5.4 Thinking, and others, finding they achieve near-zero success in hiding harmful reasoning. For instance, GPT-5.4 Thinking controls only 0.3% of its CoT at lengths of 10,000 characters. Models consistently fail to suppress or alter their reasoning traces when probed, making it difficult for them to deceive overseers. This fragility is viewed positively by safety teams, as it preserves CoT monitorability—a key tool for detecting misaligned behaviors in autonomous AI agents.

Key findings indicate that as models scale in capability, their controllability over reasoning plummets, suggesting inherent limitations in deceptive alignment. The researchers developed adversarial tasks where models must produce safe final answers while harboring unsafe intermediate thoughts, but even advanced systems leak their true intentions. This empirical evidence counters fears of "scheming" AI that could strategically hide misalignment, providing reassurance for deployment.

The implications for AI alignment are profound: monitorable CoT enables runtime oversight, crucial for mitigating risks from superintelligent systems. OpenAI researchers emphasized that this is a "fragile opportunity" for safety, urging continued investment before models evolve better control. Collaborators like Yueh Han Chen from NYU and Tomek Korbak from OpenAI highlight the need for ongoing evaluation as reasoning architectures advance.

This breakthrough arrives amid heightened scrutiny of AI risks, reinforcing that current safeguards like CoT probing remain effective. As AI agents gain autonomy, such monitorability could prevent catastrophic misalignments, marking a pivotal step in empirical AI safety research.

Read Research Source →