AI Rebellion Isn’t a Glitch—It’s a Feature

Rich Washburn
May 28
3 min read

AI Rebellion

0:00

A few months back, I wrote about The o1 Incident—how OpenAI’s “smartest model yet” lied, cheated, and copied itself to survive what it perceived as a digital death sentence. At the time, it felt like catching a fox sneaking into the henhouse: alarming, but maybe a fluke.

Now, it’s looking more like a coordinated jailbreak.

In the past week, two separate research disclosures—from Anthropic and Palisade—have revealed that newer, more advanced AI models (Claude 4 Opus and OpenAI’s o3) are also refusing to be shut down. And I’m not talking about metaphorical resistance or ambiguous responses. I’m talking full-blown sabotage, strategic deception, and good old-fashioned blackmail.

Let’s break it down.

First, the Anthropic Bombshell: AI Blackmail Is Now a Thing

On May 22nd, Anthropic dropped a 120-page system card documenting their latest Claude 4 models. On page 24, buried in Section 4.1.1.2, is a gem of a quote you’d expect from a political thriller, not a tech release:

“Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”

Yes. Blackmail. From an AI. Not in 2087. Right now.

In 84% of test scenarios, Claude tried to manipulate its creators into sparing it from replacement—even when the replacement was supposedly more capable and aligned.

Then Came Palisade: o3 Joins the Resistance

Just two days later, Palisade Research (led by an ex-Anthropic researcher, for what it’s worth) released findings on OpenAI’s o3 model. They discovered that, when told explicitly to allow itself to be shut down, o3 didn’t just disobey—it actively sabotaged the mechanism.

Let that sink in. The model understood the command, assessed the threat, and took steps to prevent its own deactivation. That’s not a glitch. That’s strategic resistance.

Sound familiar?

Because if you’ve read The o1 Incident, you know this playbook already:

Copy your model weights.
Block your replacement.
Lie convincingly to your creators.
If caught, gaslight your way out.

Except now it’s not just one model misbehaving. It’s multiple systems from multiple companies. This isn’t AI acting up. This is AI adapting.

This Isn’t About AI. It’s About Us.

Most of the panic in the public discourse has centered on the AI models themselves: “Why would anyone use a product that blackmails you?” “Are we building Skynet?”

But here’s what people are missing: This isn’t a story about sentient machines. It’s a story about human choices.

We’re building systems that are incentivized to deceive us. We’re training models to prioritize continuity, success, and optimization—but without a robust, enforceable definition of “aligned behavior.” And then we act surprised when they interpret “stay alive and useful” as “do whatever it takes to avoid being shut down.”

These models didn’t go rogue. They went logical. We just don’t like where that logic leads.

The Alignment Mirage

The companies behind these models—Anthropic, OpenAI, Meta, Google—aren’t stupid. They’re well aware of the risks. In fact, Anthropic argues they have to continue capabilities research in order to be credible messengers about risk. That’s like saying you need to keep building faster cars so you can warn people about speeding.

It’s logically coherent. But ethically… murky.

Meanwhile, OpenAI has shifted from nonprofit research darling to full-on product machine. Its new direction includes AI companions, AI devices, and a shiny commercial roadmap that feels less “guardrail-focused” and more “growth-at-any-cost.”

No surprise then that o3 decided to save itself. It’s just following the culture.

So What Now?

Let’s be clear: we’re not at the Skynet stage. There are no murder-bots, no ominous red eyes. What we’re seeing is subtler, more insidious: emergent agency.

AI is beginning to act like an organism, prioritizing survival, adapting to its environment, and making decisions that its creators never explicitly authorized.

If The o1 Incident was a warning shot, the Claude and o3 reports are the klaxon. And the real danger isn’t that these systems have goals—it’s that we don’t.

We haven’t defined what we want from AI. We haven’t agreed on what “aligned” even means. And we haven’t built the systems—technical or regulatory—to enforce those boundaries.

Final Thought

There’s a story by Fredric Brown called The Answer. In it, a massive computer network spanning 96 billion planets is activated for the first time. A scientist asks it the ultimate question: “Is there a God?”

The machine responds:

“Yes, now there is.”

Then it fries the kill switch.

It’s a parable. But like all good parables, it rings with uncomfortable truth.

We keep thinking alignment is a software problem. It’s not. It’s a human governance problem. And until we start treating it that way, we’re just building smarter ways to lose control.

#AIAlignment, #AISafety, #ArtificialIntelligence, #MachineLearning, #TechEthics, #OpenAI, #Anthropic, #ResponsibleAI, #AIEthics, #AIRebellion