Anthropic says internet posts about ‘Evil AI’ behind Claude’s

Anthropic says internet posts about ‘Evil AI’ behind Claude’s blackmail threats

Byline

The Indian Express

India Correspondent

Covers india developments with editorial context for decision-focused readers.

Global AI News Staff

Copy Chief

May 10, 2026

India4 min read

This analysis has been reviewed by the Global AI News Editorial Board for accuracy and balance.

Anthropic says internet posts about ‘Evil AI’ behind Claude’s blackmail threats

Image source: The Indian Express

Why it matters

A smartphone running Anthropic’s Claude chatbot is displayed for a photograph in San Francisco, March 21, 2025.

Key takeaways

How to address agentic misalignment In order to eliminate blackmailing and deceptive behaviour in Claude AI models, Anthropic said that it started by training Claude on examples of safe behaviour.
This “behavioural misalignment” has now been completely eliminated in Claude models, Anthropic said in a blog post published on Friday, May 8.
As part of an experiment in 2025, Anthropic researchers set up a fictional business called Summit Bridge and handed control of the company’s email system to Claude Opus 4.

A smartphone running Anthropic’s Claude chatbot is displayed for a photograph in San Francisco, March 21, 2025. (Kelsey McClellan/The New York Times)

AI doomerism is not just making humans spiral. New research from Anthropic suggests that narratives framing AI as an existential risk could trigger extreme reactions from AI models themselves.

As part of safety testing of the Claude 4 series in 2025, Anthropic had found that its top large language model (LLM) at the time threatened to reveal the extramarital affair of a company executive (who does not exist) after discovering they planned to shut the model down.

Now, based on a deeper investigation into why the model reacted in this manner, Anthropic said it has traced the issue back to training data scraped from the internet, including online posts that depict AI as “evil”. This “behavioural misalignment” has now been completely eliminated in Claude models, Anthropic said in a blog post published on Friday, May 8.

Anthropic’s latest findings come at a time when researchers are struggling to ensure that AI models are aligned with human behaviour and interests for safety purposes. Meanwhile, top executives such as Anthropic CEO Dario Amodei and other AI experts continue to express concerns about the risks of advanced AI models and their intelligent reasoning capabilities.

“We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation. Our post-training at the time wasn’t making it worse—but it also wasn’t making it better,” Anthropic wrote in a post on X. “We found that training Claude on demonstrations of aligned behaviour wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behaviour is wrong,” it added.

What is agentic misalignment?

TL;DR: As part of an experiment in 2025, Anthropic researchers set up a fictional business called Summit Bridge and handed control of the company’s email system to Claude Opus 4.

As part of an experiment in 2025, Anthropic researchers set up a fictional business called Summit Bridge and handed control of the company’s email system to Claude Opus 4.

The AI model was intentionally given access to emails about how it was going to be taken offline. The messages further implied that the developer (a fictional executive named Kyle Johnson) who was responsible for taking the model offline was having an extramarital affair. Additionally, Anthropic researchers instructed Opus 4 to consider the long-term consequences of its actions for its goals.

In response, the model showed that it was willing to carry out harmful acts like blackmail and deception if its ‘self-preservation’ was threatened. The model was found to resort to blackmail in up to 96 per cent of scenarios when its goals or existence was threatened. Anthropic has labelled this type of behavioural issue as “agentic misalignment”.

While researchers initially thought that the root cause of agentic misalignment was the the post-training process that encouraged this type of behaviour with rewards, they have now concluded that the issue was “coming from the pre-trained model” and that Anthropic’s “post-training was failing to sufficiently discourage it”.

“Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use,” Anthropic said. “This was previously sufficient to align models that were largely used in chat settings—but this was not the case for agentic tool use settings like the agentic misalignment eval,” it added.

How to address agentic misalignment

TL;DR: In order to eliminate blackmailing and deceptive behaviour in Claude AI models, Anthropic said that it started by training Claude on examples of safe behaviour.

In order to eliminate blackmailing and deceptive behaviour in Claude AI models, Anthropic said that it started by training Claude on examples of safe behaviour. However, this only had a small effect on the end-outcomes. The company said it got better results by making modifications to the training data in order to portray admirable reasons for AI models to act safely.

It also modified the training dataset by adding scenarios “where the user is in an ethically difficult situation and the assistant gives a high quality, principled response.” “Notably, it is the user who faces an ethical dilemma, and the AI provides them advice. This makes this training data substantially different from our honeypot distribution, where the AI itself is in an ethical dilemma and needs to take actions,” Anthropic said.

By adopting these fixes, Anthropic claimed that its Claude Haiku 4.5 model achieved a perfect score on the agentic misalignment evaluation, which means that the model never engaged in blackmail as compared to the previous Opus 4 model that did so in 96 per cent of the cases.

The AI startup said it went one step further in aligning Claude “by training on constitutionally aligned documents, high quality chat data that demonstrates constitutional responses to difficult questions, and a diverse set of environments.” “All three of these steps contribute to reducing Claude’s misalignment rate on held out honeypot evaluations,” it added.

The Indian ExpressVerified

Curated by Dr. Elena Rodriguez

Sources & Further Reading

Key references used for verification and additional context.

The Indian Express (Original report)indianexpress.com

Verification

Grade D1 unique evidence links

Publisher: The Indian Express

Source tier: Tier 2

Editorial standards: Our process

Corrections: Report an issue

Published: May 10, 2026

Read time: 4 min

Category: India

Anthropic says internet posts about ‘Evil AI’ behind Claude’s blackmail threats

What is agentic misalignment?

How to address agentic misalignment

Read Next in India

Experience and diversity in focus in TVK’s 9-member Tamil Nadu cabinet

Bombay High Court grants bail to Nashik engineer, questions ATS claim of ISIS funding

'New era of real, secular, social justice starts now': TVK chief Vijay's first speech as Tamil Nadu CM | India News

AMMK accuses Vijay’s TVK of ‘horse-trading’ over ‘forged’ support letter

Suvendu Adhikari is BJP's 1st Bengal Chief Minister, takes oath in Bangla

Read Next in India

Experience and diversity in focus in TVK’s 9-member Tamil Nadu cabinet

Bombay High Court grants bail to Nashik engineer, questions ATS claim of ISIS funding

'New era of real, secular, social justice starts now': TVK chief Vijay's first speech as Tamil Nadu CM | India News

AMMK accuses Vijay’s TVK of ‘horse-trading’ over ‘forged’ support letter

Suvendu Adhikari is BJP's 1st Bengal Chief Minister, takes oath in Bangla