Skip to content

PyTorch Codex Agent Efficiency Benchmark

Generated: 2026-02-22T08:23:14.041127+00:00

What This Measures

  • Real codex exec runs on a local PyTorch repository.
  • Baseline mode: autonomous retrieval with cgrep disallowed.
  • cgrep mode: cgrep command usage required.
  • Primary metric: Codex provider-reported billable tokens (input - cached_input + output).

Scenario Set

  • autograd_evaluate_function
  • tensor_iterator_impl
  • python_arg_parser_impl
  • dispatch_key_set
  • cuda_graph
  • addmm_path

For each scenario: - Success requires all marker groups to be satisfied from returned evidence. - Baseline allows only grep/rg/sed/cat/head/tail/git commands. - Baseline prompt includes a single focused rg starter hint (grep_pattern) per scenario. - cgrep mode requires cgrep search|s or cgrep definition|d commands. - cgrep prompt includes scenario-specific high-signal starter commands (cgrep_commands) and recommends scoped compact output (--format json2 --compact). - Disallowed command usage or missing required tool usage marks the run as failed.

Single-run variance can be high. Prefer --runs >= 2 and compare medians for release decisions.

Environment

  • OS: macOS-26.3-arm64-arm-64bit
  • Python: 3.12.4
  • codex model: gpt-5-codex
  • reasoning effort: medium
  • runs per scenario/mode: 2
  • cgrep commit: 7445b45
  • pytorch commit: 66e77ae932c
  • PyTorch files (git ls-files): 21634

Aggregate (All Cases)

Mode Cases Success rate Median billable tokens P95 billable tokens Median total tokens Median duration (ms) Median commands
baseline 12 91.7% 6234 34876 29722 15574.8 2.0
cgrep 12 100.0% 3858 14537 26789 14450.5 2.0
  • Total billable tokens (baseline, no cgrep): 151,466
  • Total billable tokens (cgrep): 69,874
  • Billable token reduction: 53.9%

Per Scenario

Run Scenario Mode Success Billable tokens Total tokens Duration (ms) Commands
1 autograd_evaluate_function baseline yes 3,371 17,451 8637.0 1
1 autograd_evaluate_function cgrep yes 2,833 16,913 5620.8 1
1 tensor_iterator_impl baseline yes 14,514 74,162 19344.6 5
1 tensor_iterator_impl cgrep yes 21,936 70,576 42736.2 6
1 python_arg_parser_impl baseline yes 6,424 28,568 8865.1 2
1 python_arg_parser_impl cgrep yes 3,910 26,182 11144.9 2
1 dispatch_key_set baseline yes 13,112 35,256 16284.7 2
1 dispatch_key_set cgrep yes 4,716 27,116 14905.3 2
1 cuda_graph baseline yes 5,491 19,571 5445.0 1
1 cuda_graph cgrep yes 3,312 17,392 9290.4 1
1 addmm_path baseline yes 5,583 20,815 16093.1 1
1 addmm_path cgrep yes 7,932 45,564 22769.3 4
2 autograd_evaluate_function cgrep yes 1,838 17,198 7214.6 1
2 autograd_evaluate_function baseline yes 3,352 27,800 14883.7 2
2 tensor_iterator_impl cgrep yes 3,806 26,462 13995.8 2
2 tensor_iterator_impl baseline yes 22,832 55,472 26557.8 4
2 python_arg_parser_impl cgrep yes 5,755 36,859 18409.4 3
2 python_arg_parser_impl baseline yes 6,044 30,876 15056.5 2
2 dispatch_key_set cgrep yes 3,417 27,353 16441.8 2
2 dispatch_key_set baseline no 16,601 51,929 19693.5 3
2 cuda_graph cgrep yes 1,936 17,168 7189.8 1
2 cuda_graph baseline yes 4,545 19,649 6658.0 1
2 addmm_path cgrep yes 8,483 48,035 21711.1 4
2 addmm_path baseline yes 49,597 223,165 51549.5 14

Re-run

python3 scripts/benchmark_codex_agent_efficiency.py --repo /path/to/pytorch --cgrep-bin /path/to/cgrep