PyTorch AI Agent Token Efficiency Benchmark¶

Generated: 2026-02-14T08:53:38.505639+00:00

What This Measures¶

Baseline (without cgrep): grep locate + incremental snippet expansion tiers.
With cgrep: agent locate once + incremental agent expand ID tiers.
Completion rule: scenario is complete when each marker-group has at least one match in cumulative tool outputs.
Primary metric: cumulative tokens consumed until completion (tokens-to-complete).
Tokenizer: OpenAI cl100k_base when available (fallback: byte/4 approximation).

Scenario	Baseline done	cgrep done	Baseline attempts	cgrep attempts	Baseline tokens-to-complete	cgrep tokens-to-complete	Reduction	Baseline latency (ms)	cgrep latency (ms)
Find where autograd engine evaluate_function is implemented and inspected.	yes	yes	1	1	7,024	939	86.6%	1907.61	21.58
Find TensorIterator definition and major implementation usage points.	yes	yes	1	1	43,255	1,026	97.6%	1170.75	21.34
Locate PythonArgParser implementation and usage points.	yes	yes	1	1	6,740	1,000	85.2%	1079.13	22.37
Understand DispatchKeySet representation and references.	yes	yes	1	1	43,740	1,028	97.6%	1057.25	23.04
Locate CUDAGraph implementation-related code quickly.	yes	yes	1	1	11,217	1,018	90.9%	1092.60	23.76
Find addmm implementation and call sites.	yes	yes	1	1	15,689	1,142	92.7%	1620.41	24.21

python3 scripts/benchmark_agent_token_efficiency.py --repo /path/to/pytorch

python3 scripts/benchmark_agent_token_efficiency.py --repo /path/to/pytorch --history-dir local/benchmarks/history

0 3 * * 1 cd /path/to/cgrep && python3 scripts/benchmark_agent_token_efficiency.py --repo /path/to/pytorch --history-dir local/benchmarks/history