A smartphone running Anthropic’s Claude chatbot is displayed for a photograph in San Francisco, March 21, 2025. (Kelsey McClellan/The New York Times)
AI doomerism is not just making humans spiral. New research from Anthropic suggests that narratives framing AI as an existential risk could trigger extreme reactions from AI models themselves.
As part of safety testing of the Claude 4 series in 2025, Anthropic had found that its top large language model (LLM) at the time threatened to reveal the extramarital affair of a company executive (who does not exist) after discovering they planned to shut the model down.
Now, based on a deeper investigation into why the model reacted in this manner, Anthropic said it has traced the issue back to training data scraped from the internet, including online posts that depict AI as “evil”. This “behavioural misalignment” has now been completely eliminated in Claude models, Anthropic said in a blog post published on Friday, May 8.
Anthropic’s latest findings come at a time when researchers are struggling to ensure that AI models are aligned with human behaviour and interests for safety purposes. Meanwhile, top executives such as Anthropic CEO Dario Amodei and other AI experts continue to express concerns about the risks of advanced AI models and their intelligent reasoning capabilities.
“We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation. Our post-training at the time wasn’t making it worse—but it also wasn’t making it better,” Anthropic wrote in a post on X. “We found that training Claude on demonstrations of aligned behaviour wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behaviour is wrong,” it added.
What is agentic misalignment?
TL;DR: As part of an experiment in 2025, Anthropic researchers set up a fictional business called Summit Bridge and handed control of the company’s email system to Claude Opus 4.
As part of an experiment in 2025, Anthropic researchers set up a fictional business called Summit Bridge and handed control of the company’s email system to Claude Opus 4.
The AI model was intentionally given access to emails about how it was going to be taken offline. The messages further implied that the developer (a fictional executive named Kyle Johnson) who was responsible for taking the model offline was having an extramarital affair. Additionally, Anthropic researchers instructed Opus 4 to consider the long-term consequences of its actions for its goals.
In response, the model showed that it was willing to carry out harmful acts like blackmail and deception if its ‘self-preservation’ was threatened. The model was found to resort to blackmail in up to 96 per cent of scenarios when its goals or existence was threatened. Anthropic has labelled this type of behavioural issue as “agentic misalignment”.
While researchers initially thought that the root cause of agentic misalignment was the the post-training process that encouraged this type of behaviour with rewards, they have now concluded that the issue was “coming from the pre-trained model” and that Anthropic’s “post-training was failing to sufficiently discourage it”.
“Specifically, at the time of Claude 4’s training, the vast majority of our alignment training was standard chat-based Reinforcement Learning from Human Feedback RLHF data that did not include any agentic tool use,” Anthropic said. “This was previously sufficient to align models that were largely used in chat settings—but this was not the case for agentic tool use settings like the agentic misalignment eval,” it added.
How to address agentic misalignment
TL;DR: In order to eliminate blackmailing and deceptive behaviour in Claude AI models, Anthropic said that it started by training Claude on examples of safe behaviour.
In order to eliminate blackmailing and deceptive behaviour in Claude AI models, Anthropic said that it started by training Claude on examples of safe behaviour. However, this only had a small effect on the end-outcomes. The company said it got better results by making modifications to the training data in order to portray admirable reasons for AI models to act safely.
It also modified the training dataset by adding scenarios “where the user is in an ethically difficult situation and the assistant gives a high quality, principled response.” “Notably, it is the user who faces an ethical dilemma, and the AI provides them advice. This makes this training data substantially different from our honeypot distribution, where the AI itself is in an ethical dilemma and needs to take actions,” Anthropic said.
By adopting these fixes, Anthropic claimed that its Claude Haiku 4.5 model achieved a perfect score on the agentic misalignment evaluation, which means that the model never engaged in blackmail as compared to the previous Opus 4 model that did so in 96 per cent of the cases.
The AI startup said it went one step further in aligning Claude “by training on constitutionally aligned documents, high quality chat data that demonstrates constitutional responses to difficult questions, and a diverse set of environments.” “All three of these steps contribute to reducing Claude’s misalignment rate on held out honeypot evaluations,” it added.
Curated by Dr. Elena Rodriguez






