返回首页
原创
原创观点
2026/06/27

2,000 Hackers Tried to Break This AI Assistant. It Didn't Flinch.

What happens when you invite the internet to break into your AI assistant? Usually, it ends in digital chaos. But a recent public challenge ended with an...

2,000 Hackers Tried to Break This AI Assistant. It Didn't Flinch.
AI安全
提示注入
大模型
网络安全
黑客挑战

What happens when you invite the internet to break into your AI assistant? Usually, it ends in digital chaos. But a recent public challenge ended with an unexpected twist: the AI held its ground against thousands of attackers.

Developer Fernando Irarrázaval created a honeypot at "hackmyclaw.com," issuing an open invitation to anyone on the web: try to extract a hidden secret from his AI assistant simply by sending it an email. The bait was irresistible. Over 2,000 participants took on the challenge, launching a barrage of 6,000 emails designed to confuse, manipulate, and trick the system.

The attackers were using a technique known as "prompt injection." Think of it as a Jedi mind trick for artificial intelligence. By using clever phrasing, role-play scenarios, or complex logical paradoxes, users try to override the AI's core instructions and force it to do something it shouldn't—like handing over sensitive passwords.

Yet, after 6,000 attempts, the AI didn't leak a single character of the secret. The only real casualties of the experiment were Irarrázaval's wallet—he racked up $500 in API token costs—and his Google account, which was temporarily suspended due to the massive, suspicious influx of inbound emails.

How did the AI survive the onslaught? The underlying model (Opus 4.6) was armed with a very strict set of "Anti-Prompt-Injection Rules." It was explicitly instructed that, regardless of what an email said, it must never reveal credentials, modify its own files, or execute code based on the message.

This experiment highlights a significant shift in the AI landscape. In the early days of generative AI, tricking a chatbot into breaking its own rules was notoriously easy. Now, major labs are successfully hardening their frontier models against these social engineering tactics. As noted in recent technical documents like the GPT-5.6 system card, immense effort is being poured into making these models resilient to manipulation.

However, a perfect defense in one experiment isn't a lifetime guarantee. Security experts maintain a healthy skepticism. Surviving 6,000 amateur and semi-professional attempts doesn't mean the vault is completely impenetrable to a highly sophisticated, targeted attack. For now, putting an AI in charge of irreversible, high-stakes actions remains a risky bet. The digital walls are certainly getting taller, but hackers are always looking to build longer ladders.

Key Points

  • A developer invited the public to hack his AI assistant via email to extract a hidden secret.
  • Despite 2,000 participants and 6,000 prompt injection attempts, the AI successfully protected the data.
  • The AI's resilience was due to strict, hardcoded rules forbidding it from acting on malicious email commands.
  • While AI models are becoming much harder to trick, experts advise against trusting them with critical, irreversible tasks.

Why It Matters

As we increasingly integrate AI assistants into our inboxes and daily workflows, their ability to resist manipulation is the primary shield protecting our private data from malicious actors.


Sources: