TraceBase Technical Report - v1.1
Coding agents do not need more context. They need remembered work.
This report measures TraceBase as a repeated-work layer: the first run solves from scratch, the second run receives only compact memories from prior solved work. The question is not whether the model is smarter. The question is whether the agent stops paying for the same investigation twice.
resolved rate
+20 pp
63% -> 83%
avg cost
-43%
high-confidence matches, n=26
avg steps
-26%
31 -> 11 on best run
regressions
0
8 new fixes
summary
Measured benchmark. Hard grader, clean lift.
The benchmark uses 40 SWE-bench Verified tasks and 80 total agent runs: each task runs once as a baseline and once with TraceBase attached. Across the 40 paired runs that completed cleanly, resolved rate moved from 63% to 83%, a +20 percentage-point gain. On high-confidence matches, the agent used 43% less cost and 26% fewer steps on average.
This is the benchmark TraceBase is built for: repeated engineering work where prior runs contain reusable mechanisms. The lift comes from shortening the investigation path, not from changing the model.
paired outcome chart
Resolved patches
Efficiency index
benchmark design
Real tasks. Paired replay.
The benchmark follows a simple paired design: real bug-fix tasks, identical agent settings, one changed variable. Instead of asking an LLM judge whether a patch looks good, the submitted patch is graded against the task's test oracle in an isolated container.
evaluation parameters
- Dataset
- SWE-bench Verified
- Tasks
- 40
- Agent runs
- 80 (40 baseline + 40 TraceBase)
- Completed pairs
- 40
- Harness
- mini-swe-agent v2.2.8 + Docker
- Model control
- Same model, temperature, tool budget, and Docker image in both arms
- Step cap
- 40 tool steps per task
- Cost cap
- $1.00 per task
- Primary metric
- Official resolved / unresolved
- Efficiency metrics
- Steps and spend on paired completed runs
01
Start from real issues
40 verified bug-fix tasks with executable tests.
02
Run baseline
The agent solves with its normal tools and an empty memory layer.
03
Store resolved work
Successful runs are distilled into compact situation, mechanism, unlock, and verification notes.
04
Run with TraceBase
The same task distribution runs again with recall enabled and the same caps.
05
Grade by tests
A containerized grader marks the submitted patch resolved or unresolved.
results
The useful lift is not just tokens. It is trajectory compression.
TraceBase improved the resolved count by 8 tasks with 0 regressions. The larger effect shows up in the trajectories: on matches where retrieval had enough confidence to inject, agents reached the patch faster and spent less time re-reading files or revisiting known dead ends.
resolved rate
| Condition | Resolved | Rate | Delta |
|---|---|---|---|
| Baseline agent | 25 / 40 | 63% | - |
| TraceBase attached | 33 / 40 | 83% | +20 pp |
Relative lift: +32% over baseline. Absolute lift is the number we use in public copy.
efficiency on high-confidence matches (n=26)
| Metric | Average | Best run |
|---|---|---|
| Cost reduction | -43% | -70% |
| Step reduction | -26% | -65% |
best paired run: 31 -> 11 steps.
mechanism attribution
Attribution is based on TraceBase event categories and paired trajectory inspection. It explains where the measured efficiency delta came from; it is not a separate accuracy claim.
quality scoring
What improved besides cost
We filtered scored results to tasks where at least one arm produced a meaningful patch. Across those tasks, TraceBase improved every scored dimension: not just whether the agent got to an answer, but whether the answer fit the codebase.
Correctness +32%
Correctness asks the blunt question: did the patch actually fix the issue? TraceBase wins when prior mechanisms steer the agent away from plausible but wrong edits.
Completeness +46%
Completeness measures whether the agent covered the edge case, the call path, and the verification path rather than stopping at the first green-looking change.
Best practices +52%
Best practices score whether the patch follows the project shape. Memory helps because it carries the local way this codebase expects fixes to land.
Code reuse +75%
Code reuse is where repeated work compounds fastest: the agent reaches known helpers, known tests, and known failure paths instead of rediscovering them.
Architectural fit +58%
Architectural fit captures whether the solution belongs in the system, not just whether it passes locally. TraceBase gives the agent prior shape, not a blind patch.
why it works
The agent already knows how to code. It forgets what it learned.
Most repeated-agent waste is not raw model weakness. It is missing local knowledge: which attempted fix failed, which file actually owned the bug, which verification command exposed the failure, and which compact patch shape finally worked. TraceBase stores those facts as reusable reasoning blocks and injects them only when retrieval is confident enough to be net-positive.
memory payload
situation: reconnect race after flaky network blip
mechanism: stale sequence buffer accepts old open-event path
unlock: dedupe by sequence id before reopening socket
verification: failing case reproduced + relevant test module
The key is that TraceBase does not shove an entire prior transcript into context. It serves a short hypothesis with evidence and asks the agent to verify it against the current code. That keeps the intervention small enough to save tokens and precise enough to shorten the search.
architecture
Recall is gated. Bad memories are demoted.
The store is project-scoped SQLite. Retrieval starts with fingerprint and FTS5/BM25, then combines structural, Jaccard, cosine, and freshness signals. The serving layer estimates expected net value before injection, and the lifecycle loop demotes blocks whose observed helpfulness falls below the threshold.
serving path
prompt -> candidate recall -> calibrated gate -> compact injection -> agent run -> outcome event
outcome event -> helpfulness attribution -> weight update -> demote / keep / merge
scope
Where the lift applies. Repeated work is the wedge.
- The benchmark measures repeated engineering work: tasks where past resolved runs can provide reusable mechanism-level memory.
- Efficiency averages are reported on high-confidence matches, where TraceBase had enough evidence to inject.
- The measured benchmark is 40 tasks / 80 agent runs. The claim is scoped to repeated engineering work.
- Memory is injected as a hypothesis. The agent still has to verify the current code before editing.
- Cost savings vary with model pricing, tool-output volume, and how repetitive the team's work actually is.
method
How the benchmark was run. One changed variable.
Each task was evaluated as a pair. The baseline arm ran with the standard agent stack and an empty TraceBase store. The TraceBase arm ran with the same model settings, tool budget, timeout, and containerized test environment, with memory recall enabled before the agent started editing.
comparison contract
- Task source
- Verified bug-fix tasks with executable tests
- Arms
- Baseline agent vs. TraceBase attached
- Controls
- Same model settings, tool budget, timeout, and grader
- Memory input
- Only compact memories distilled from resolved work
- Primary outcome
- Patch passes the task's test oracle
- Efficiency outcome
- Tool steps and spend on matched completed pairs
That keeps the comparison clean: TraceBase does not receive the answer key, does not change the model, and does not alter the grading harness. It changes only what the agent remembers before it starts the next run.