エージェント構築前に計算すべき複合確率の罠

2026年05月10日 #Tech

複雑なAIエージェントは、個々のステップが成功しても複合確率により、全体の成功率が劇的に低下するという信頼性の課題を抱えている。

加えて、外部コンテンツを読み取るエージェントは、プロンプト注入攻撃に対し構造的な脆弱性を内包している。

この信頼性の問題を回避するには、過度な複雑化を避け、可能な限りシンプルな設計を維持することが推奨される。

セキュリティ対策としては、エージェントの権限を限定するスコープ最小化が必須であり、機密情報と処理を行うLLMを分離するデュアルLLMアーキテクチャが有効である。

原文の冒頭を表示（英語・3段落のみ）

Here's a calculation your team probably hasn't done: 0.95^10.It equals 0.599.If you build a 10-step agent where each step succeeds 95% of the time — which sounds high — your end-to-end success rate is approximately 60%. Four out of ten runs, something in the chain goes wrong.Most teams building agents never run this math. They test individual tools, each one looks solid, and they ship feeling confident. Then production arrives and shows them what compound probability actually means.This is the autonomy trap. It isn't a model quality problem. It's arithmetic.The cultural pressure to build agents is real. Agents are impressive in demos. They signal that you're working at the frontier. They're easier to fundraise around than "we built really solid evaluation infrastructure."So teams extend their systems. Five steps, then eight, then twelve. Each step passes its tests. Each tool call returns results that look right. The unit-test view of the world says: all components are working, therefore the system works.The compound probability view says: six 95%-reliable components give you 74% end-to-end reliability. Ten give you 60%. Twenty give you 36%.The troubling thing is that this failure mode is invisible unless you're explicitly tracking end-to-end success rate as a metric. Most teams aren't. They track component metrics. The system can fail one in three interactions while every individual component shows green.Anthropic's engineering team — after working with dozens of organizations deploying agents across industries — found that the most successful implementations weren't the most architecturally ambitious. They were the simplest. Their guidance reads less like enthusiasm for complex agents and more like earned caution: before adding another step, ask whether the task could be handled with a single well-designed LLM call.The reliability problem is bad. The security problem is worse.Prompt injection is the defining vulnerability of the AI agent era. The mechanism is simple: any agent that reads external content — emails, web pages, documents, user messages — is exposing its reasoning loop to untrusted input. And any of that untrusted input can contain instructions.Simon Willison, who has been writing about prompt injection since 2022, defines it precisely: it's the vulnerability that exists when an application concatenates instruction prompts with untrusted content, and the model fails to reliably distinguish between "these are my instructions" and "this is data I'm working with."In a chatbot, this is a minor embarrassment. In an agent — one with the ability to send emails, update databases, make API calls — the scope of harm is the full scope of what the agent can do.Willison demonstrated this in 2023 with an email assistant prototype. Send a carefully crafted email, and the assistant would forward the user's entire inbox to an attacker's address — without any explicit user request. The assistant wasn't malfunctioning. It was following instructions that happened to come from a malicious email rather than its owner. He calls this the "confused deputy" problem: a system with elevated privileges that can be manipulated by an adversary who has less privilege.The uncomfortable reality: every AI agent that reads external content and can take external actions is structurally vulnerable to this attack. The degree of risk scales with what the agent can do and how much external content it processes. The vulnerability cannot be fully closed with current LLM architectures, because the LLM is simultaneously the component that executes instructions and the component that reads the untrusted data containing those instructions.Given that prompt injection can't be solved at the model level, the practical answer is scope minimization: limit what the agent can do so that a successful injection has bounded consequences.An agent that only needs to read documents should not have write access. An agent that drafts emails should not be able to send them without explicit human confirmation. An agent searching internal systems should not have access to external communication channels.Willison proposed a Dual LLM architecture for this: separate the agent into a privileged LLM that has tool access and takes actions, and a quarantined LLM that reads untrusted external content. The quarantined LLM never gets tool access. Its outputs are treated as data by the privileged layer, not as instructions. It doesn't fully prevent injection — but it substantially raises the difficulty.Every additional capability is additional attack surface. The math compounds in both directions: reliability degrades with each step added, and attack surface grows with each tool granted.Build the simplest agent that does the job. Add complexity only when you can absorb the failure rate it introduces. And before you grant the agent any new capability, ask what happens when an adversary is holding the other end of every external input it reads.This post expands on Chapter 8 of Wrong by Default: What AI Builders Know That Everyone Else Doesn't by Alokit. Available on Kindle ($7.99): amazon.com/dp/B0GZCY9CGFNo posts

※ 著作権に配慮し、引用は冒頭3段落までです。続きは元記事をご覧ください。

— 元記事を読む ↗

元記事を読む ↗