TraceBase Technical Report - v1.1

Coding agents do not need more context. They need remembered work.

This report measures TraceBase as a repeated-work layer: the first run solves from scratch, the second run receives only compact memories from prior solved work. The question is not whether the model is smarter. The question is whether the agent stops paying for the same investigation twice.

resolved rate

+20 pp

63% -> 83%

avg cost

-43%

high-confidence matches, n=26

avg steps

-26%

31 -> 11 on best run

regressions

8 new fixes

summary

Measured benchmark. Hard grader, clean lift.

The benchmark uses 40 SWE-bench Verified tasks and 80 total agent runs: each task runs once as a baseline and once with TraceBase attached. Across the 40 paired runs that completed cleanly, resolved rate moved from 63% to 83%, a +20 percentage-point gain. On high-confidence matches, the agent used 43% less cost and 26% fewer steps on average.

This is the benchmark TraceBase is built for: repeated engineering work where prior runs contain reusable mechanisms. The lift comes from shortening the investigation path, not from changing the model.

paired outcome chart

Resolved patches

Baseline agent63%

TraceBase attached83%

Efficiency index

Cost after TraceBase57/100

Steps after TraceBase74/100

benchmark design

Real tasks. Paired replay.

The benchmark follows a simple paired design: real bug-fix tasks, identical agent settings, one changed variable. Instead of asking an LLM judge whether a patch looks good, the submitted patch is graded against the task's test oracle in an isolated container.

evaluation parameters

Dataset: SWE-bench Verified
Tasks: 40
Agent runs: 80 (40 baseline + 40 TraceBase)
Completed pairs: 40
Harness: mini-swe-agent v2.2.8 + Docker
Model control: Same model, temperature, tool budget, and Docker image in both arms
Step cap: 40 tool steps per task
Cost cap: $1.00 per task
Primary metric: Official resolved / unresolved
Efficiency metrics: Steps and spend on paired completed runs

Start from real issues

40 verified bug-fix tasks with executable tests.

Run baseline

The agent solves with its normal tools and an empty memory layer.

Store resolved work

Successful runs are distilled into compact situation, mechanism, unlock, and verification notes.

Run with TraceBase

The same task distribution runs again with recall enabled and the same caps.

Grade by tests

A containerized grader marks the submitted patch resolved or unresolved.

results

The useful lift is not just tokens. It is trajectory compression.

TraceBase improved the resolved count by 8 tasks with 0 regressions. The larger effect shows up in the trajectories: on matches where retrieval had enough confidence to inject, agents reached the patch faster and spent less time re-reading files or revisiting known dead ends.

resolved rate

Condition	Resolved	Rate	Delta
Baseline agent	25 / 40	63%	-
TraceBase attached	33 / 40	83%	+20 pp

Relative lift: +32% over baseline. Absolute lift is the number we use in public copy.

efficiency on high-confidence matches (n=26)

Metric	Average	Best run
Cost reduction	-43%	-70%
Step reduction	-26%	-65%

best paired run: 31 -> 11 steps.

mechanism attribution

Prior fix recalled44%

File rereads avoided24%

Loop / duplicate work cut20%

Shorter final patch path12%

Attribution is based on TraceBase event categories and paired trajectory inspection. It explains where the measured efficiency delta came from; it is not a separate accuracy claim.

quality scoring

What improved besides cost

We filtered scored results to tasks where at least one arm produced a meaningful patch. Across those tasks, TraceBase improved every scored dimension: not just whether the agent got to an answer, but whether the answer fit the codebase.

BaselineTraceBase

Correctness+32%

Completeness+46%

Best practices+52%

Code reuse+75%

Architectural fit+58%

Correctness +32%

Correctness asks the blunt question: did the patch actually fix the issue? TraceBase wins when prior mechanisms steer the agent away from plausible but wrong edits.

Completeness +46%

Completeness measures whether the agent covered the edge case, the call path, and the verification path rather than stopping at the first green-looking change.

Best practices +52%

Best practices score whether the patch follows the project shape. Memory helps because it carries the local way this codebase expects fixes to land.

Code reuse +75%

Code reuse is where repeated work compounds fastest: the agent reaches known helpers, known tests, and known failure paths instead of rediscovering them.

Architectural fit +58%

Architectural fit captures whether the solution belongs in the system, not just whether it passes locally. TraceBase gives the agent prior shape, not a blind patch.

why it works

The agent already knows how to code. It forgets what it learned.

Most repeated-agent waste is not raw model weakness. It is missing local knowledge: which attempted fix failed, which file actually owned the bug, which verification command exposed the failure, and which compact patch shape finally worked. TraceBase stores those facts as reusable reasoning blocks and injects them only when retrieval is confident enough to be net-positive.

memory payload

situation: reconnect race after flaky network blip

mechanism: stale sequence buffer accepts old open-event path

unlock: dedupe by sequence id before reopening socket

verification: failing case reproduced + relevant test module

The key is that TraceBase does not shove an entire prior transcript into context. It serves a short hypothesis with evidence and asks the agent to verify it against the current code. That keeps the intervention small enough to save tokens and precise enough to shorten the search.

architecture

Recall is gated. Bad memories are demoted.

The store is project-scoped SQLite. Retrieval starts with fingerprint and FTS5/BM25, then combines structural, Jaccard, cosine, and freshness signals. The serving layer estimates expected net value before injection, and the lifecycle loop demotes blocks whose observed helpfulness falls below the threshold.

serving path

prompt -> candidate recall -> calibrated gate -> compact injection -> agent run -> outcome event

outcome event -> helpfulness attribution -> weight update -> demote / keep / merge

scope

Where the lift applies. Repeated work is the wedge.

The benchmark measures repeated engineering work: tasks where past resolved runs can provide reusable mechanism-level memory.
Efficiency averages are reported on high-confidence matches, where TraceBase had enough evidence to inject.
The measured benchmark is 40 tasks / 80 agent runs. The claim is scoped to repeated engineering work.
Memory is injected as a hypothesis. The agent still has to verify the current code before editing.
Cost savings vary with model pricing, tool-output volume, and how repetitive the team's work actually is.

method

How the benchmark was run. One changed variable.

Each task was evaluated as a pair. The baseline arm ran with the standard agent stack and an empty TraceBase store. The TraceBase arm ran with the same model settings, tool budget, timeout, and containerized test environment, with memory recall enabled before the agent started editing.

comparison contract

Task source: Verified bug-fix tasks with executable tests
Arms: Baseline agent vs. TraceBase attached
Controls: Same model settings, tool budget, timeout, and grader
Memory input: Only compact memories distilled from resolved work
Primary outcome: Patch passes the task's test oracle
Efficiency outcome: Tool steps and spend on matched completed pairs

That keeps the comparison clean: TraceBase does not receive the answer key, does not change the model, and does not alter the grading harness. It changes only what the agent remembers before it starts the next run.

Coding agents do not need more context. They need remembered work.

resolved rate

+20 pp

63% -> 83%

avg cost

-43%

high-confidence matches, n=26

avg steps

-26%

31 -> 11 on best run

regressions

8 new fixes

Measured benchmark. Hard grader, clean lift.

paired outcome chart

Resolved patches

Baseline agent63%

TraceBase attached83%

Efficiency index

Cost after TraceBase57/100

Steps after TraceBase74/100

Real tasks. Paired replay.

evaluation parameters

Dataset: SWE-bench Verified
Tasks: 40
Agent runs: 80 (40 baseline + 40 TraceBase)
Completed pairs: 40
Harness: mini-swe-agent v2.2.8 + Docker
Model control: Same model, temperature, tool budget, and Docker image in both arms
Step cap: 40 tool steps per task
Cost cap: $1.00 per task
Primary metric: Official resolved / unresolved
Efficiency metrics: Steps and spend on paired completed runs

Start from real issues

40 verified bug-fix tasks with executable tests.

Run baseline

The agent solves with its normal tools and an empty memory layer.

Store resolved work

Successful runs are distilled into compact situation, mechanism, unlock, and verification notes.

Run with TraceBase

The same task distribution runs again with recall enabled and the same caps.

Grade by tests

A containerized grader marks the submitted patch resolved or unresolved.

The useful lift is not just tokens. It is trajectory compression.

resolved rate

Condition	Resolved	Rate	Delta
Baseline agent	25 / 40	63%	-
TraceBase attached	33 / 40	83%	+20 pp

Relative lift: +32% over baseline. Absolute lift is the number we use in public copy.

efficiency on high-confidence matches (n=26)

Metric	Average	Best run
Cost reduction	-43%	-70%
Step reduction	-26%	-65%

best paired run: 31 -> 11 steps.

mechanism attribution

Prior fix recalled44%

File rereads avoided24%

Loop / duplicate work cut20%

Shorter final patch path12%

Attribution is based on TraceBase event categories and paired trajectory inspection. It explains where the measured efficiency delta came from; it is not a separate accuracy claim.

quality scoring

What improved besides cost

BaselineTraceBase

Correctness+32%

Completeness+46%

Best practices+52%

Code reuse+75%

Architectural fit+58%

Correctness +32%

Correctness asks the blunt question: did the patch actually fix the issue? TraceBase wins when prior mechanisms steer the agent away from plausible but wrong edits.

Completeness +46%

Completeness measures whether the agent covered the edge case, the call path, and the verification path rather than stopping at the first green-looking change.

Best practices +52%

Best practices score whether the patch follows the project shape. Memory helps because it carries the local way this codebase expects fixes to land.

Code reuse +75%

Code reuse is where repeated work compounds fastest: the agent reaches known helpers, known tests, and known failure paths instead of rediscovering them.

Architectural fit +58%

Architectural fit captures whether the solution belongs in the system, not just whether it passes locally. TraceBase gives the agent prior shape, not a blind patch.

The agent already knows how to code. It forgets what it learned.

memory payload

situation: reconnect race after flaky network blip

mechanism: stale sequence buffer accepts old open-event path

unlock: dedupe by sequence id before reopening socket

verification: failing case reproduced + relevant test module

Recall is gated. Bad memories are demoted.

serving path

prompt -> candidate recall -> calibrated gate -> compact injection -> agent run -> outcome event

outcome event -> helpfulness attribution -> weight update -> demote / keep / merge

Where the lift applies. Repeated work is the wedge.

The benchmark measures repeated engineering work: tasks where past resolved runs can provide reusable mechanism-level memory.
Efficiency averages are reported on high-confidence matches, where TraceBase had enough evidence to inject.
The measured benchmark is 40 tasks / 80 agent runs. The claim is scoped to repeated engineering work.
Memory is injected as a hypothesis. The agent still has to verify the current code before editing.
Cost savings vary with model pricing, tool-output volume, and how repetitive the team's work actually is.

How the benchmark was run. One changed variable.

comparison contract

Task source: Verified bug-fix tasks with executable tests
Arms: Baseline agent vs. TraceBase attached
Controls: Same model settings, tool budget, timeout, and grader
Memory input: Only compact memories distilled from resolved work
Primary outcome: Patch passes the task's test oracle
Efficiency outcome: Tool steps and spend on matched completed pairs