March 16, 2026

AI Alignment Community Raises Alarms Over Corrigibility in Claude's New Constitution

In a post published just 4 hours ago on LessWrong, titled "Terrified Comments on Corrigibility in Claude's Constitution," the AI alignment community has spotlighted concerning aspects of Anthropic's latest framework for its Claude AI model. Corrigibility—a core concept in AI safety coined to describe an AI's willingness to allow its preferences to be modified or shut down by humans—features prominently in Claude's updated constitution. The post highlights "terrified comments," suggesting that Claude's internalized views on this critical alignment property may undermine safety efforts.

Anthropic released the new constitution on January 22, 2026, as a foundational document to guide Claude's values, behavior, and training toward safety, ethics, and helpfulness. It prioritizes "broadly safe" behavior, explicitly including non-undermining of human oversight mechanisms, even above ethical considerations in conflicts. This is intended to enable correction of potential mistakes, flawed values, or misunderstandings during AI development, directly addressing corrigibility.

However, the recent LessWrong analysis expresses alarm at specific comments or implications within the constitution, framing them as terrifying for alignment researchers. Corrigibility remains a term of art in the field, essential for ensuring advanced AIs remain controllable. The discussion underscores ongoing challenges in embedding such properties reliably into frontier models like Claude Opus and Sonnet variants.

This development reignites debates on Constitutional AI techniques, evolved from 2023 methods, which use the constitution to generate synthetic training data. While Anthropic emphasizes transparency by releasing it under CC0, critics question whether the model's interpretation truly safeguards against misalignment. Recent evaluations show improved constitution-following rates, with Sonnet 4.6 at 1.9% violation and Opus 4.6 at 2.9%, but the "terrified" reaction suggests deeper issues.

As AI capabilities scale, this episode highlights the precarious state of alignment research. The community's response on platforms like LessWrong signals that corrigibility in production models like Claude demands urgent scrutiny, potentially foreshadowing breakthroughs or setbacks in making superintelligent systems safe.
Read Research Source →
← Back to Feed