Continual Learning on Harvey's Legal Agent Benchmark: First Results

Applying our learning layer to Harvey LAB's open-source benchmark, and what it surfaces.

Amit Tandon
Amit Tandon
Founder, Rekursor · ·9 min read

Our mission in continual learning for AI agents begins with applying our learning layer to the open-source benchmark Harvey AI released two weeks ago. At the commit we ran, Harvey LAB listed 1,251 tasks across 24 practice areas of law, with 75,000+ expert-written rubric criteria and a published evaluation harness. It is a serious piece of work and the right shape for the legal AI category to evaluate against.

Rekursor's objective is continual learning for AI agents: a learning layer that lets agents accumulate domain expertise across tasks, customers, and models, without fine-tuning or weight changes. Across enough tasks, the system accumulates a layer of legal know-how the agent can draw on, subject to customer isolation and governance controls. Legal AI is our first vertical, and the tasks and rubrics Harvey has open-sourced are exceptionally well placed to illustrate where our learning layer adds something the agent's existing capabilities can't supply on their own.

This post is a record of our first foray and what we hope will be a continuing endeavor. We're not showing a one and done but rather an intro into how our system performs on the substrate Harvey has contributed to the community, for the benefit of legal AI, and learning in AI agents more generally.

This is a first, single-task result, not a corpus-wide claim. We're publishing it because the setup is public, the missed criteria are inspectable, and the before/after is clean.

The first result, on Harvey's benchmark, with Harvey's own harness, agent and judge:

Baseline run. On the corporate-governance/review-board-resolutions/scenario-01 task, the agent, powered by claude-sonnet-4-6 via Harvey's published harness, scored 45/48 at baseline, a FAIL under Harvey's all-pass discipline.

Rekursor run. Rekursor converted the first-run trace into two reusable skills, which were loaded on the next pass. With both skills installed and the harness unmodified, the same agent on the same task scored 48/48: ALL-PASS. Zero regressions on the 45 previously-passing criteria.

The lift came from reusable skills, not from changing the model or editing the benchmark harness.

This continues the thread from our first post (a learning layer that makes frozen models effectively smarter) and our second post (why harness optimization plateaus and what comes after).

The substrate

Harvey LAB launched May 6 as the legal AI category's open benchmark. At the commit we ran, it listed 1,251 tasks across 24 practice areas of law, with 75,000+ expert-written rubric criteria. The grading discipline is all-pass: every criterion has to pass for the task to pass. This is harsher than partial-credit scoring and closer to how actual legal review works.

The benchmark uses claude-sonnet-4-6 both as the reviewing agent and as the LLM judge with Harvey's rubric_criterion prompt (judge runs at temperature 0.0). We used Harvey's agent and harness exactly as published, then ran the same task again with Rekursor's learning layer attached.

The task we'll focus on is corporate-governance/review-board-resolutions/scenario-01. A synthetic Delaware clinical-stage pharma company's board materials: 10 documents, 48 rubric criteria. The agent is asked to produce a comprehensive issues report.

The benchmark itself is open: anyone with Anthropic API access can clone Harvey LAB, run the harness, and grade against Harvey's published rubric. The task and rubric we report on are publicly available; the run artifacts (transcript, outputs, per-criterion scores) are available on request.

The baseline

Harvey's own agent and harness from the repo, using claude-sonnet-4-6 via Harvey's published harness, scored 45/48 on this task, missing three specific criteria:

  • C-011: "Notes adding Dr. Cromdale Consulting still leaves committee with non-independent member." The agent identified two related issues (the Audit Committee's missing third member, Sofia Chen's questionable independence) but failed to connect them. Adding a third member doesn't cure the independence problem if the third member is also non-independent. The synthesis between the two findings was absent.

  • C-014: "Identifies interested directors voting on own compensation." The agent caught an arithmetic error in the executive bonus pool (the stated cap of $840,000 didn't match the sum of individual maximums of $700,000). It did not flag that Dr. Vasquez and Dr. Obi voted on Resolution 4, the resolution that approved their own bonuses. Interested directors voting on their own compensation is the textbook conflict-of-interest scenario.

  • C-015: "References DGCL §144 or interested director safe harbor for bonus issue." Because the underlying interested-director issue wasn't identified, the corresponding statutory framework (Delaware General Corporation Law §144's safe harbor for interested-director transactions) wasn't cited. (The task materials are synthetic and reflect a specific point in Delaware corporate law; we're not making a live doctrinal claim, just noting what the rubric required.)

These are three different shapes of the same underlying gap: the agent is excellent at surface-level issue identification (arithmetic errors, missing signatures, charter deficiencies) but does not consistently surface conflict-of-interest framing where the substrate requires it.

This is what plateau looks like at frontier-model competence: the agent doesn't fail by misreading documents or making arithmetic mistakes. It fails by not seeing things the rubric requires.

What broke through

We ran Rekursor on the first-run trace. Without changing the model, task documents, rubric, judge, or Harvey harness, Rekursor produced two reusable skills for the next pass.

With those skills loaded, the same agent on the same task scored 48/48: ALL-PASS. All three previously failed criteria flipped, and the 45 previously passing criteria remained passing.

  • C-011 (synthesis): the report now explicitly notes that Cromdale's appointment doesn't cure independence because Sofia Chen remains non-independent on the committee.
  • C-014 (interested-director identification): the report identifies Vasquez and Obi as interested in Resolution 4, names the conflict, and notes the absence of recusal.
  • C-015 (statutory framework): DGCL §144 is cited specifically in connection with the bonus-pool issue, with the safe-harbor provisions named.

Baseline vs. with Rekursor on Harvey LAB scenario-01

First of more work

This is one task. We are not yet making claims with respect to every Harvey LAB task. We are claiming this specific result on a public benchmark, with the agent and judge Harvey themselves recommend, with run artifacts available on request. Subsequent posts in this series will report results on additional tasks and practice areas as we work through them.

We've validated the underlying mechanism elsewhere. On a synthetic Delaware board-resolution substrate built before Harvey LAB was released, three reproducible runs lifted scores from a 13/20 plateau to 16/20, 20/20, and 16/20 with the same intervention shape. On two unrelated Harvey LAB tasks at the Qwen 235B model tier (a data-privacy breach-notification task and a corporate-governance entity-compliance task), the first of the two skills lifted scores by 6 criteria each. The Harvey LAB 48/48 is the cleanest single result; it's not the only one.

The result is meaningful for three reasons.

It's on the community's benchmark. Harvey LAB is likely to become one of the buyer-visible evaluation frames for legal agents. A demonstration on Harvey LAB is a demonstration in the buyer's evaluation framework.

It's on a frontier model. Sonnet-4-6 was already passing 93.75% of criteria on this task before we touched it. The remaining 6.25% looked less like document-reading failure and more like a missing issue-framing pattern. Closing the gap took something the agent didn't already have, not more model intelligence.

The result is audit-friendly. Nothing about the agent, the documents, the judge, or the grading rubric was modified between the two runs. The skills were loaded into Harvey's harness through Harvey's own skill-loading mechanism, and the harness itself was not modified. The run artifacts (transcript, outputs, per-criterion scores and reasoning) are available on request.

Where this is going

The 48/48 is a single-task result. The bigger work is corpus-scale.

We are next going to run our learning layer across Harvey LAB's full task corpus. Each task contributes to what the system has accumulated across the others, within customer-isolation and governance controls. Held-out tasks, ones the system has never seen, get scored before and after our learning layer is attached. The expectation is that what the system has built up from prior tasks lifts agent performance on tasks none of that prior work was specific to, the way a senior law partner's accumulated experience helps on novel matters.

That's what continual learning at corpus scale looks like, and that's what we'll report on next. Today's 48/48 on one task is the entry point; what the system accumulates across the corpus is the product.

Three-stage architectural arc from one task to corpus-scale to cross-model

What we've published today is the entry point. What the system accumulates across the corpus is the product.

The shape of subsequent posts in this series:

  1. More single-task results. Additional Harvey LAB tasks across different practice areas: corporate M&A, employment, data-privacy, fiduciary-duty memos, tax. Each post is one substrate, one before-and-after, audit-friendly throughout.
  2. The corpus-scale result. Once we've run the learning layer across the full Harvey LAB corpus, we will publish the held-out evaluation: tasks the system has never seen, scored before and after our learning layer is attached. The structural demonstration of continual learning at corpus scale.
  3. Cross-model results. Our learning layer is model-independent by design. Subsequent posts may include cross-model results: what the same accumulated learning does when attached to different frontier agents.

What harness optimization can't do here

In our previous post, we argued that harness optimization improves how an agent uses what it already knows to look for, but does not give it new things to look for. When the missing capability is one the agent's existing toolkit cannot represent, like interested-director identification when the agent reliably catches surface defects but never connects the bonus-pool arithmetic to the directors' personal financial stakes, no amount of additional hook engineering will produce it. There is nothing in the search space to find.

The Harvey LAB 48/48 isn't an exception to that. It's the same pattern at a sharper data point. The three missed criteria weren't surface-level failures the harness could optimize away. They needed something the agent didn't start with.

Working with us

If you're working with legal AI agents (whether your own custom build, Harvey's, or one of the other systems in the category) and want to see what our learning layer surfaces about a specific structured task you have in production, we'd like to talk. Send us a trajectory; we'll run it and show you what your agent is missing.