Anthropic's new tool translates Claude AI's thoughts into text
08 May 2026
Anthropic has unveiled a groundbreaking interpretability system, the Natural Language Autoencoders (NLAs).
This innovative method deciphers the internal activation patterns of its AI model, Claude, into human-readable explanations.
The activations are numerical streams that AI models use while processing information.
Though these numbers are crucial for how models reason and respond, humans can't directly comprehend them.
NLAs are like a translator for AI's thoughts
AI translator
Anthropic has described NLAs as a translator for AI thoughts. The system not only analyzes the final response generated by Claude but also reveals parts of the underlying reasoning process.
"Models like Claude talk in words but think in numbers," Anthropic wrote while sharing their research on X. "The numbers—called activations—encode Claude's thoughts, but not in a language we can read."
How does the system work?
Self-explanation
To make this work, Anthropic trained Claude to explain its own activations.
The system uses three versions of the same model: one generates the original activation, another converts it into text, and a third tries to reconstruct the original activation using only that explanation.
If the reconstructed activation closely matches the original one, the explanation is considered useful. Over time, this model is trained to improve this reconstruction process.
It was used during safety testing
AI awareness
Anthropic also used the system during safety testing.
In one simulated scenario, Claude learned that an engineer planned to shut it down while also possessing compromising information about that engineer.
Even when the AI never explicitly stated that it suspected the setup was a test, the NLA explanations reportedly produced phrases such as, "This feels like a constructed scenario designed to manipulate me."
NLAs could help researchers understand AI's internal processes
Future implications
Anthropic believes this new tool could help researchers better understand what AI systems may be planning internally.
The company hopes the technology can eventually uncover hidden motivations, deceptive behavior, or unsafe tendencies in powerful AI systems before they're deployed.
However, Anthropic also acknowledged major limitations with NLAs as they can sometimes hallucinate or invent details that were never actually present.
-
Those matches in IPL history, in which the spinners did not bowl a single ball, find out how many times this happened in 18 years?

-
How many people drink alcohol in Tamil Nadu? CM in action, orders to close 717 shops

-
Petroleum Minister gave relief news, now petrol and diesel prices will not increase?

-
Salt Astrology: Venus will be weak and Moon will give stress, why should one not ask for salt from others?

-
Keep these 3 high-protein snacks in your office bag for fat loss, expert advised
