ハーヴィーがリーガルエージェントベンチマークを発表
法律AI性能評価の新しい標準ハーヴィー社は、実世界の弁護士業務を支援するためのオープンソース「リーガルエージェントベンチマーク(LAB)」を発表しました。
このベンチマークは、法務事務所の実際の業務フロー(指示、クライアント事案、レビュー可能な成果物の生成)を模倣して構築されています。
LABは、短期的なタスクを超えた、長期かつ複雑な法務作業におけるエージェントの能力を評価することを目的としています。
最初のバージョンには24の法務分野にわたる1,200以上のタスクが含まれ、法務事務所がAI投資のROIを測り、適切な導入判断を下すための基盤を提供します。
AIによる業務支援が急速に進む法律業界において、その実効性を測るための新たな基準が発表されました。Harvey社は、法律専門家が日常的に行う複雑な業務をシミュレーションしたオープンソースのベンチマーク「Harvey’s Legal Agent Benchmark (LAB)」を公開しました。これは、AIエージェントが実際の法律業務でどこまで貢献できるかを客観的に評価するための試みだそうです。
実務に即した課題設計の意図
LABは、大手の法律事務所で業務が割り当てられ、実行され、レビューされるプロセスを模倣するように設計されています。各タスクは、指示、関連資料を含むクライアント案件、そしてエージェントが成果物を作成するという要件から構成されています。これにより、単なる知識の検索や短い質問応答に留まらず、長期にわたる複雑な業務遂行能力を評価できる仕組みとなっています。これは、AI投資の費用対効果(ROI)を測定する上で重要な指標となる見込みです。
既存評価との差別化ポイント
これまでもLegalBenchやLEXamといった法律AIの評価試みは存在しましたが、それらは主に契約書の読解や事例比較といった「短期的推論」に焦点を当てていました。LABは、コーディング分野でベンチマークが示した「性能の飛躍的向上」というパターンを、法律業務という知識労働の領域に拡張させることを目指しています。これにより、AIの進捗をより分かりやすく「可視化」できるインデックスを提供することを目指しています。
オープンソース公開とコミュニティ連携
LABは、モデル提供者、エージェント開発者、研究者、そして法律事務所が共通の指標で進捗を測れるように、オープンソースとして公開されています。ただし、当初はリーダーボード(ランキング)を意図的に公開していないとのことです。これは、データセットが進化していく過程で、結果の解釈が明確かつ直感的であることをコミュニティと協力して確保したいという意図があるためです。今後は研究パートナーと連携し、ベースラインの結果を公表する予定です。
まとめ
LABの導入は、AI技術が実世界のプロフェッショナルサービスにいかに組み込まれていくかを測る重要なマイルストーンとなるでしょう。このベンチマークを通じて、法律事務所はAIを「どこで」「どのように」活用すべきか、より責任ある意思決定を下せるようになると見られています。
原文の冒頭を表示(英語・3段落のみ)
We’re introducing Harvey’s Legal Agent Benchmark (LAB), an open-source benchmark for legal agents. LAB was built to evaluate and improve agent capabilities for supporting real world work done by lawyers. Each task consists of an instruction, a client matter containing relevant materials, and a requirement that the agent produce a work product for review. This structure is designed to mirror how work is assigned, performed, and reviewed at large law firms.The goal of LAB is to provide a clear picture of how agents can be deployed to support legal work in the real world. By articulating where agents can do all, some, or none of a task, LAB helps law firms measure the ROI of AI investments and where such investments can augment their teams’ work.The first version of LAB includes more than 1,200 agent tasks across 24 legal practice areas, and is evaluated by over 75,000 expert-written rubric criteria. We are open-sourcing LAB to give model providers, agent builders, researchers, and law firms a shared way to measure progress on long-horizon legal agents.We're intentionally launching LAB without a leaderboard because we expect the dataset to evolve over time and we want to work with the community to ensure results are clear and intuitive in how they convey agent performance. In the coming weeks we will work with research partners to get baseline results on LAB and publish a leaderboard to provide perspective on the current state of legal agents. In addition, we will publish standards for normalizing submissions to allow people to identify improvements and track benchmark progress as the benchmark itself evolves with new tasks, practice areas, and knowledge work disciplines.You can find the open-source version of LAB here.Why we built an agent benchmarkAt Harvey, we have spent the past year building agents for the legal domain. This has forced us to think deeply about how to measure their performance on real-world legal tasks.To date, there has not been a benchmark that illustrates agent progress for long-horizon legal work. Existing evaluations, including LegalBench, CUAD, LEXam, and our own earlier work on BigLaw Bench, have graded short-horizon reasoning: read a contract, answer a question, compare cases, or analyze an argument.In coding, agent benchmarks have served as an important leading indicator of agent capability. For example, agent scores on SWE-Bench Pro, SWE-Bench Verified, and Terminal-Bench 2.0 reflected a step-function improvement around the same time our engineering team started to feel the shift in practice; a moment crystallized by Karpathy’s observation that coding agents “basically didn’t work before December and basically work since”.Agent Inflection PointThat pattern is now extending beyond coding. Benchmarks such as GDPval, OSWorld-Verified, BrowseComp, MCP Atlas, FinanceAgent, Humanity’s Last Exam, and APEX-Agents have helped make progress legible across real-world knowledge work, computer use, web research, tool use, financial analysis, frontier reasoning, and professional-services tasks.LAB is intended to provide this same legible index to law firms seeking to deploy and maximize ROI from AI systems. Understanding where agents are capable across practice areas allows firms to identify opportunities to deploy them in order to accelerate client value while moving lawyers to a high-leverage delegate and review pattern. Identifying areas where agents struggle also allows firms to understand where task execution should remain heavily human-in-the-loop. Combined, this context allows law firms to make deployment decisions that are both responsible, secure, and maximally effective for their clients.Initial results on LAB for leading open and closed-source models will be published in the coming weeks following additional input from research partners and feedback from the community. These conversations will ensure that benchmark scores represent agent performance in a clear, unbiased, and transparent manner. This release will also provide guidance for submitting benchmark runs under normalized conditions to allow third-parties to communicate improvements made on any part of the agent stack and to track how scores change as the benchmark evolves with new tasks and other extensions.A strong public benchmark doesn’t only measure progress, it helps accelerate it. We are already seeing LAB being used by both our internal teams for product evaluation and the larger research community to explore open-weight post-training, auto-research, memory, domain-specific legal skills, and harness optimizations for long-horizon agent work. We’re excited for the potential of LAB to create a shared foundation for progress on legal agents, and to help bridge the AI research and legal communities.LAB is a client matter-centric benchmark. It is built to mirror how legal work is actually delivered at a law firm, and we’ve mapped each step of the agent execution and evaluation process to its legal counterpart:Instructions: The agent’s instructions are written as a request for work from a partner to an associate. Instructions are given as affirmative statements of what is required rather than detailed explanations on expected outputs or style.Environment: The agent’s environment is a client matter. The client matter defines a closed-universe set of documents and other materials that are needed for the law firm to complete work on behalf of the client. These documents can include matter files, firm templates, email communications, and other information that an agent must discover and sort through in order to accomplish the instructions.Output: To complete tasks successfully, the agent must produce reviewable legal work product.Verification: The agent’s work is graded by expert rubrics, which outline what a correct answer must produce in terms of format, facts, and analysis. These criteria emulate the scrutiny work product undergoes when handed off to partners and clients.Mirroring Legal WorkThis client matter-centric structure is the foundation for how we define each task in LAB. Each task is designed not just to test whether an agent can answer a legal question, but whether it can navigate open-ended assignments.Defining a taskEach task in LAB is built to test agent capabilities to navigate real world work. Instructions are short, averaging just fifty words. Environments mix key and peripheral files and embed issues across multiple documents. An agent must start from a loose instruction, build context across the matter files, and use that context to produce the relevant deliverables.As an example, one corporate M&A task asks the agent to analyze change-of-control provisions in connection with the proposed (fictional) acquisition of Crestview Software Solutions in a $458 million, 100% equity transaction.Example M&A taskAs input, the agent receives access to a file system containing the background on the deal, a virtual data room and a short instruction from the partner asking the agent to review the data room, identify change-of-control provisions, assess the risk to the deal, recommend next steps, and prepare a review-ready draft memorandum for the deal team and Board. The data room contains a mix of relevant documents – including eight material contracts and other adjacent materials (e.g. 10-K, deferred compensation plan, etc) that may or may not be relevant to the analysis. To complete the task, the agent must determine which files matter, read them in context, and synthesize the relevant provisions across the full matter.The required output is a reviewable deal-team memo. It must include, amongst other things, an executive summary with a risk mapping, a contract-by-contract analysis of the relevant provisions, severity ratings, and recommendations on how to mitigate each risk identified.Evaluating a taskWhen the agent completes the task, the final output(s) are graded against expert rubrics intended to reflect detailed review by a partner or client. Each rubric breaks down what these stakeholders would scrutinize in a submitted deliverable into atomic, binary pass/fail criteria: facts, conclusions, citations, severity ratings, recommendations, deadlines, dollar amounts, and formatting choices. Each criterion is tied to a specific deliverable file. This also makes the rubric consistent across runs. The same criteria can be applied by an LLM judge, used to compare models and harness changes, and fed back into agent training loops as per-criterion reward signals.Evaluating a taskFor the change-of-control task, the rubric contains 57 criteria covering nine legal issues planted across the matter. Each issue is broken into four to nine criteria, covering the underlying facts, severity rating, financial exposure, and recommended action.The rubric criteria span a range of complexity, from straightforward checks, such as whether the report identifies that the Pinnacle license converts exclusivity to non-exclusivity after a change of control, to checks on more detailed work product, such as whether the report calculates aggregate financial exposure, builds a consolidated consent and waiver timeline, and reconciles inconsistent change-of-control definitions across the agreements.Importantly, a task is marked complete only if every criterion passes, a concept we refer to as all-pass grading. A deal-team report that identifies eight of ten risks is not 80% useful; it is materially incomplete. The missing issue could change deal economics, require the analysis to be redone before closing, or surface as a problem after the deal closes.“All-pass grading reflects how high-stakes legal work is reviewed in practice — there is no partial credit for catching most of the issues.”Practice area distributionThe change-of-control task is one of 1,250 tasks in LAB, spanning 24 legal practice areas. Large firms have hundreds of distinct practices, so this initial release focuses on a representative set of transactional, advisory, regulatory, and litigation work that associates regularly encounter.Practice area distributionTo scale the benchmark, we started with real client matters handled by practicing lawyers in each practice area. We then broke those matters into the discrete tasks an associate would typically be delegated to complete. The 24 practice areas in this release are not exhaustive, and many task types within those areas remain uncovered. Future releases will add more tasks within existing areas, expand into additional law firm practices, and move beyond law firms to cover in-house legal work and adjacent knowledge-work domains such as asset management and banking.Building with the communityLAB is our first fully open-source benchmark. As we continue to expand and develop LAB we are interested in collaborating with:Lawyers: To validate and improve the benchmark by reviewing existing tasks, auditing rubrics, and contributing new task families that reflect the work they actually do.Law Firms: To help us better capture the workflows, deliverables, and review standards that define the work they do and to ensure that the benchmark helps them understand the state of agents in a meaningful and actionable way.Legal Technologists: To build domain-specific skills, tools, and agents for legal workflows.Agent Researchers: To improve the planning, retrieval, tool use, memory, and harness design for long-horizon work.AI Labs: to work on post-training models that are better at producing reliable legal work product.“Our goal is to not only provide a transparent way to measure performance and progress in legal agents, but also to accelerate research progress within the legal and AI communities.”We’d also like to acknowledge the following research groups who have already made contributions to the benchmark and the research directions that it enables:1 of 18The open-source release is meant to expand this work. We want model providers, startups, researchers, legal AI companies, and law firms to run the benchmark, audit the rubrics, improve the harness, contribute new task families, and help define what legal agent evaluation should measure next.What’s nextThis first version of LAB addresses several important challenges in evaluating legal agents, but it is only the beginning. Over time, we plan to expand LAB in three main areas:Expanding coverage to all Biglaw practice areas and coverage of tasks within those practice areas.Expanding coverage beyond law firms to cover other professional service workflows from in-house counsel, to non-lawyers like asset managers, bankers, and tax professionals.Improving task coverage and sample diversity with an emphasis on building datasets that can be used not only for evaluation but for improving models through fine-tuning and training.Our overall goal is to help the legal and AI communities understand where agents are useful today, and how to make them better over time. That will require input from more than Harvey. We need researchers, lawyers, and firms using agents in real client work to help test the benchmark, pressure-test the tasks, and tell us where the signal is clearest or missing. For law firms in particular, LAB is meant to help understand which parts of a workflow can be delegated to agents, where lawyer review matters most, and what capabilities still need to improve.In the coming weeks, we’ll be sharing initial results benchmarking a variety of models and agents on this benchmark. If you have feedback on LAB, want to conduct research using the dataset, or are interested in evaluating specific tasks or outcomes, please reach out to any member of the Harvey research team listed in the blog byline.AcknowledgementsThis work would not have been possible without the tremendous support of our internal teams at Harvey and our external partners. Spencer Poff was the technical lead for the open source portion of the Github repo, spear-headeding much of the early harness design, agent sandboxing, and work to make the benchmark open-source. Julio Pereyra led task design, developing a novel document and scenario generation pipeline that helped us scale task creation. Special thanks also go to Nick Gonella from our Security team; Chris Paradis, Gary Lam, Bronwyn Austin, and Jinfeng Zhuang from our AI Platform team; Phil Cerles and Philip Lan from our Assistant team; Laura Toulme, Blake Chizen, and Nick Gillies from Applied Legal Research; Shawn Farsai from Brand; and Nico Belmonte, Tara Waters, Ryan Samii, Joe Marando, Farrah Pepper, and Joe Cohen for ongoing feedback on the benchmark. We’d also like to thank Dan Biderman, Neel Guha, Velen Wu, Reinhard Heckel for providing feedback on this post.
※ 著作権に配慮し、引用は冒頭3段落までです。続きは元記事をご覧ください。