Reasoning Injection for AI Agent Efficiency
TraceBase Technical Report · April 2026 · v1.0
Overview
TraceBase captures proven reasoning patterns from AI agent runs and injects them into future runs at the point of recall. The system uses multi-signal retrieval (BM25, Jaccard, structural, cosine, temporal freshness) with adaptive weights learned via Thompson Sampling to match relevant patterns to incoming tasks.
Patterns are stored in a compressed 3-field format: the situation the agent encountered, the dead ends it explored, and the unlock that led to the correct solution. This format was selected based on research into chain-of-thought compression (C3oT, arxiv 2412.11664) and token-budget-aware reasoning (TALE, arxiv 2412.18547).
This document explains the benchmark methodology, how metrics are derived, and what the results mean for real-world agent deployments.
1. Evaluation Setup
All benchmarks were run on SWE-bench Verified, a curated subset of real GitHub issues from popular open-source Python repositories. Each problem requires the agent to diagnose a bug from an issue description and produce a working patch, executed in a Docker container via mini-swe-agent v2.2.8.
Eval Parameters
The evaluation uses a multi-round methodology: Round 0 (baseline) solves tasks with an empty knowledge base. Successful patches become traces in the KB. Round 1 solves the same tasks with this KB — simulating the compound intelligence effect that occurs in production as agents accumulate institutional knowledge. Both rounds use identical step limits, cost limits, and Docker environments.
2. Results — SWE-bench Verified
Accuracy — Sonnet 4.6 (16 tasks)
| Condition | Patches Submitted | Accuracy | Gain |
|---|---|---|---|
| Baseline (no injection) | 10/16 | 62% | — |
| + TraceBase | 12/16 | 75% | +20% |
2 new fixes (astropy-13579, astropy-14508). Zero regressions.
Efficiency — High-Confidence Matches (10 tasks)
| Metric | Average | Peak |
|---|---|---|
| Step reduction | 17% | Up to 45% |
| Cost reduction | 34% | Up to 49% |
Additional Benchmark — TypeScript Fixtures (6 models)
| Model | Step Save | Avg Token Save | Peak Token Save |
|---|---|---|---|
| Claude Haiku 4.5 | +5% | 6% | Up to 48% |
| Claude Sonnet 4.6 | +25% | 31% | Up to 39% |
| Claude Opus 4.6 | +25% | 30% | Up to 39% |
| GPT-5.4-nano | 0% | 13% | Up to 33% |
| GPT-5.4-mini | +8% | 25% | Up to 50% |
| GPT-5.3-chat | +25% | 44% | Up to 52% |
10 TypeScript fixtures with vitest verification. 100% accuracy maintained across all models.
The largest efficiency gains were on GPT-5.3-chat (+44% avg token save) and Claude Sonnet/Opus (+25% step save). On SWE-bench Verified, the best single task (astropy-14309) went from 31 steps to 13 steps — a 58% step reduction and 64% cost reduction.
3. Cost Savings Methodology
Cost savings come from two mechanisms: fewer agent steps (the model reaches the correct solution faster) and shorter reasoning per step(the model doesn't explore dead ends it would have otherwise). Estimated dollar savings are calculated from the observed step and token reduction and current model pricing.
estimated_savings = tasks_with_injection × avg_cost_saved_per_task
avg_cost_saved_per_task = baseline_cost × avg_cost_reduction_rate
Example — Sonnet 4.6 at scale
4. Why It Works
AI agents fail not because the model lacks ability, but because they re-explore dead ends on every call. The 3-field pattern format encodes: the situation the agent encountered, the dead ends to avoid, and the unlock that led to the correct solution. This steers the model past failure modes it would have otherwise explored, reducing both wasted steps and incorrect outputs.
Key research principles behind the injection format:
- Compressed directives under 60 tokens — shorter chains are more likely correct (arxiv 2505.17813)
- First-message injection — avoids token multiplication from context rot across steps (arxiv 2510.05381)
- Positive constraintsover negative framing — “the bug is X, fix: Y” not “do not try A, B, C” (arxiv 2601.18044)
- Skip-to-fix strategy when prior knowledge is available — plan-and-act instead of explore-first (arxiv 2503.09572)
The pattern library compounds over time — patterns that work for one team's agents improve results for everyone on the platform. As more agents use the system, the library grows, match quality improves, and the confidence gate fires on a higher percentage of tasks.
5. Technical Architecture
TraceBase uses a 6-signal multi-stage retrieval engine:
| Signal | Type | What It Catches |
|---|---|---|
| Fingerprint | Exact | Identical problem (O(1) lookup) |
| BM25 | Lexical | Same keywords, different phrasing |
| Jaccard | Token overlap | Structural keyword matching |
| Structural | Feature | Same error type / language / framework |
| Cosine | Semantic | Embedding similarity (optional) |
| Freshness | Temporal | Recency bias (exponential decay) |
Signal weights are learned from user feedback via Thompson Sampling (Agrawal & Goyal, 2012). Quality scoring uses the Wilson score interval lower bound. The system is fully local-first (SQLite + WAL), zero external dependencies, with optional cloud sync.
6. Limitations
- Benchmarks were run on SWE-bench Verified (Python/astropy). Results on other languages and domains may differ.
- Cost savings vary by model, task complexity, and the quality of the retrieved pattern match.
- The match rate (~55% on this benchmark) depends on pattern library coverage for a given problem domain. Teams running agents on repetitive domain-specific tasks typically see higher match rates as the library accumulates relevant patterns.
- Step and cost reductions are measured on tasks where both baseline and augmented agents submitted patches. Tasks where only one condition submitted are excluded from efficiency calculations.
7. Reproducibility
TraceBase is open source (MIT). All benchmark code, fixtures, seeds, and raw trajectory data are available in the repository:
eval/swebench/ — SWE-bench Verified harness + results
eval/agentic/ — TypeScript fixture benchmark + results
eval/tasks/ — Task definitions + seed traces
To reproduce:
pip install mini-swe-agent
npx tsx eval/agentic/runner.ts --all # TypeScript benchmark
bash eval/swebench/run-benchmark.sh # SWE-bench Verified