终端基准测试 3.0 开发启动

#Tech

终端基准测试 3.0 开发启动

Terminal-Bench 3.0 正在积极开发中,旨在成为 AI 智能体的新型基准测试。

该版本将包含 100 个具有挑战性的任务,预计最佳模型在发布时只能解决其中的 30%。

任务范围将扩展到软件工程、系统管理、安全、科学计算等领域,并欢迎来自各个领域的贡献者参与构建。

贡献者需要创建真实、有偿的计算机任务,并需有明确的指令和可靠的验证机制,并鼓励更长周期和更丰富的环境设计。

查看原文开头(英文 · 仅前 3 段)

We're excited to announce that Terminal-Bench 3.0 is now in active development — the next version of Terminal-Bench.

Our goal for Terminal-Bench 3.0 is 100 diverse tasks targeting at most 30% solve rate from the best models at release. We want tasks that are genuinely difficult — longer-horizon, richer environments, and requiring specialized expertise.

Terminal-Bench 2.0 covered software engineering, sys-admin, security, and scientific computing. For Terminal-Bench 3.0, we're expanding to an even wider variety of domains. Any realistic, valuable, and challenging computer task that can be accomplished via the command line and programmatically verified is fair game.

※ 出于版权考虑,仅引用前 3 段。完整内容请阅读原文。

阅读原文 ↗