PyTorch Codex Agent Efficiency Benchmark¶
Generated: 2026-02-22T08:23:14.041127+00:00
What This Measures¶
- Real
codex execruns on a local PyTorch repository. - Baseline mode: autonomous retrieval with cgrep disallowed.
- cgrep mode: cgrep command usage required.
- Primary metric: Codex provider-reported billable tokens (
input - cached_input + output).
Scenario Set¶
autograd_evaluate_functiontensor_iterator_implpython_arg_parser_impldispatch_key_setcuda_graphaddmm_path
For each scenario:
- Success requires all marker groups to be satisfied from returned evidence.
- Baseline allows only grep/rg/sed/cat/head/tail/git commands.
- Baseline prompt includes a single focused rg starter hint (grep_pattern) per scenario.
- cgrep mode requires cgrep search|s or cgrep definition|d commands.
- cgrep prompt includes scenario-specific high-signal starter commands (cgrep_commands) and recommends scoped compact output (--format json2 --compact).
- Disallowed command usage or missing required tool usage marks the run as failed.
Single-run variance can be high. Prefer
--runs >= 2and compare medians for release decisions.
Environment¶
- OS:
macOS-26.3-arm64-arm-64bit - Python:
3.12.4 - codex model:
gpt-5-codex - reasoning effort:
medium - runs per scenario/mode:
2 - cgrep commit:
7445b45 - pytorch commit:
66e77ae932c - PyTorch files (
git ls-files):21634
Aggregate (All Cases)¶
| Mode | Cases | Success rate | Median billable tokens | P95 billable tokens | Median total tokens | Median duration (ms) | Median commands |
|---|---|---|---|---|---|---|---|
baseline |
12 | 91.7% | 6234 | 34876 | 29722 | 15574.8 | 2.0 |
cgrep |
12 | 100.0% | 3858 | 14537 | 26789 | 14450.5 | 2.0 |
- Total billable tokens (baseline, no cgrep): 151,466
- Total billable tokens (cgrep): 69,874
- Billable token reduction: 53.9%
Per Scenario¶
| Run | Scenario | Mode | Success | Billable tokens | Total tokens | Duration (ms) | Commands |
|---|---|---|---|---|---|---|---|
| 1 | autograd_evaluate_function |
baseline |
yes | 3,371 | 17,451 | 8637.0 | 1 |
| 1 | autograd_evaluate_function |
cgrep |
yes | 2,833 | 16,913 | 5620.8 | 1 |
| 1 | tensor_iterator_impl |
baseline |
yes | 14,514 | 74,162 | 19344.6 | 5 |
| 1 | tensor_iterator_impl |
cgrep |
yes | 21,936 | 70,576 | 42736.2 | 6 |
| 1 | python_arg_parser_impl |
baseline |
yes | 6,424 | 28,568 | 8865.1 | 2 |
| 1 | python_arg_parser_impl |
cgrep |
yes | 3,910 | 26,182 | 11144.9 | 2 |
| 1 | dispatch_key_set |
baseline |
yes | 13,112 | 35,256 | 16284.7 | 2 |
| 1 | dispatch_key_set |
cgrep |
yes | 4,716 | 27,116 | 14905.3 | 2 |
| 1 | cuda_graph |
baseline |
yes | 5,491 | 19,571 | 5445.0 | 1 |
| 1 | cuda_graph |
cgrep |
yes | 3,312 | 17,392 | 9290.4 | 1 |
| 1 | addmm_path |
baseline |
yes | 5,583 | 20,815 | 16093.1 | 1 |
| 1 | addmm_path |
cgrep |
yes | 7,932 | 45,564 | 22769.3 | 4 |
| 2 | autograd_evaluate_function |
cgrep |
yes | 1,838 | 17,198 | 7214.6 | 1 |
| 2 | autograd_evaluate_function |
baseline |
yes | 3,352 | 27,800 | 14883.7 | 2 |
| 2 | tensor_iterator_impl |
cgrep |
yes | 3,806 | 26,462 | 13995.8 | 2 |
| 2 | tensor_iterator_impl |
baseline |
yes | 22,832 | 55,472 | 26557.8 | 4 |
| 2 | python_arg_parser_impl |
cgrep |
yes | 5,755 | 36,859 | 18409.4 | 3 |
| 2 | python_arg_parser_impl |
baseline |
yes | 6,044 | 30,876 | 15056.5 | 2 |
| 2 | dispatch_key_set |
cgrep |
yes | 3,417 | 27,353 | 16441.8 | 2 |
| 2 | dispatch_key_set |
baseline |
no | 16,601 | 51,929 | 19693.5 | 3 |
| 2 | cuda_graph |
cgrep |
yes | 1,936 | 17,168 | 7189.8 | 1 |
| 2 | cuda_graph |
baseline |
yes | 4,545 | 19,649 | 6658.0 | 1 |
| 2 | addmm_path |
cgrep |
yes | 8,483 | 48,035 | 21711.1 | 4 |
| 2 | addmm_path |
baseline |
yes | 49,597 | 223,165 | 51549.5 | 14 |
Re-run¶
python3 scripts/benchmark_codex_agent_efficiency.py --repo /path/to/pytorch --cgrep-bin /path/to/cgrep