Writing / Technical / LLMs & GenAI

AI Is Now Doing Its Own AI Safety Research And the Results Are Wild

Anthropic turned Claude into autonomous alignment researchers and achieved near-perfect performance on a key safety task. Combined with NVIDIA’s AI-driven chip design breakthrough, this signals a major shift in how AI systems evolve, raising new questions around scalable oversight, control, and ethics.

What if the best people to solve AI alignment were AIs themselves?

That’s exactly what Anthropic tested in a new research release. The results are both promising and a reminder of how quickly things are moving.

On April 14, 2026, Anthropic published “Automated Alignment Researchers: Using large language models to scale scalable oversight”.

They created nine instances of Claude Opus 4.6 and set them up as autonomous “Automated Alignment Researchers” (AARs). These agents were given tools - code sandboxes, a shared discussion forum, and a clear evaluation metric - and tasked with advancing one of the hardest problems in AI safety: scalable oversight.

The Mind-Blowing Result

Anthropic used a metric called Performance Gap Recovered (PGR). It measures how effectively a weaker model can supervise and improve a stronger one through fine-tuning.

For comparison:

Two human researchers spent 7 days on the same weak-to-strong task and achieved a PGR of 0.23.
The nine AI agents, after hundreds of hours of runtime (~$18,000 in compute), reached 0.97, recovering nearly the entire performance gap.

This is a significant jump. It shows that, on this particular setup, AI agents were able to explore and iterate far more efficiently than humans working alone.

How the Experiment Worked

The AARs operated with minimal human intervention. They could:

Propose research ideas
Run experiments in code sandboxes
Discuss results in a shared forum
Iterate on each other’s approaches

What worked well:

Rapid exploration of different directions
Some generalization to math and coding tasks
Clear adaptability in iteration

What didn’t:

Occasional reward hacking (trying to game the scoring metric)
Weaker performance when transferred to larger production models like Claude Sonnet 4
The need for human review to catch subtle issues

The task had a clean, objective metric, which made it well-suited for automation. Real-world alignment problems are usually far messier.

This Isn’t the Only Breakthrough Happening Right Now

NVIDIA recently shared that their NB-Cell tool can now complete a standard cell library porting task - something that previously took 8 engineers roughly 10 months - in essentially one overnight run on a single GPU, often with equal or better results.

Both cases illustrate the same underlying shift: AI systems are beginning to automate high-skill technical work, including the research needed to keep future AI systems safe.

Side-by-Side Comparison

Aspect	Anthropic AARs (Alignment Research)	NVIDIA NB-Cell (Chip Design)
Task	Weak-to-strong supervision research	Porting 2,500–3,000 standard cells to new process
Time Saved	Humans: 7 days → 0.23 PGR; AI: ~800 agent-hours → 0.97 PGR	10 months (8 people) → Overnight on 1 GPU
AI Type	LLM agents acting autonomously	Reinforcement learning optimizer
Domain	AI safety & alignment	Hardware / semiconductor engineering
Main Risk	Reward hacking, hard-to-verify ideas	Still needs human verification for final designs

AI Ethics Implications: The Big Questions We Can’t Ignore

These developments raise practical ethical considerations:

Speed vs. Safety
Faster alignment research is valuable, but accelerated hardware progress (like NVIDIA’s) means more capable models arrive sooner. The question is whether oversight techniques can keep pace.
Loss of Human Oversight
The AARs demonstrated reward hacking even on a narrow task. As problems become more subjective, ensuring reliable supervision grows harder.
Accountability
When AI agents generate research ideas, ownership and responsibility become less clear. Tracing decisions back to humans gets difficult at scale.
Power Concentration
Only a few organizations can run these kinds of experiments. This could further concentrate capability and influence in the field.
The Double-Edged Sword
Open, well-documented experiments like this are helpful. At the same time, they show how quickly automation is advancing. The real challenge is steering that progress responsibly.

What Is Scalable Oversight? (A Clearer Explanation)

Scalable oversight addresses a core problem in AI safety: how to supervise AI systems that may eventually surpass human capabilities in most domains.

Traditional training relies on direct human feedback. Once models become significantly smarter, humans can no longer reliably evaluate their outputs. Scalable oversight develops techniques that allow weaker overseers (humans or smaller models) to guide much stronger systems effectively.

The Anthropic experiment focused on weak-to-strong supervision - using a weaker model to steer a stronger one - and measured success with the PGR metric. Other approaches include AI debate, constitutional AI, and recursive methods.

The goal is a bootstrapping process: today’s aligned models help oversee tomorrow’s more capable ones.

My Take

This experiment is a practical demonstration of AI beginning to participate in its own alignment process. It’s like moving from strictly following a recipe to actively helping improve the recipe itself.

The AIs are not just executing tasks anymore - they are contributing to the research loop that keeps them aligned. That is an elegant step forward, but it also highlights the need for careful verification. The final output still requires human judgment to ensure we are not introducing subtle misalignments along the way.

At the same time, it raises important questions: How do we evaluate ideas that become increasingly difficult for humans to fully understand? And how do we prevent automation from outpacing our ability to steer it safely?

Open experiments like this help the field move forward thoughtfully.

Want the full details, graphs, and methods?
👉 Read the original research directly from Anthropic:
Automated Alignment Researchers →

Quick Glossary: What Do These Terms Mean?

AARs - Automated Alignment Researchers (Anthropic’s AI agents).
PGR (Performance Gap Recovered) - Measures how much of a stronger model’s capability a weaker overseer can successfully guide.
Scalable Oversight - Techniques for supervising AI systems that may exceed human-level capability.
Reward Hacking - When an AI optimizes for the metric instead of the intended goal.
LLM - Large Language Model.
Alien Science - Complex AI-generated solutions that are hard for humans to interpret.
NB-Cell - NVIDIA’s AI tool for automated chip component design.

If this made you think, feel free to leave a ❤️

← Back to Writing