Official codebase of the paper "Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents",
By Changdae Oh1, Wendi Li1, Seongheon Park1, Samuel Yeh1,
Tanwi Mallick2, and Sharon Li1.
1University of Wisconsin--Madison, 2Argonne National Laboratory
- [Jun 24, 2026] The paper is now alive at arXiv 🔗; we release the intial codebase.
- [Jun 1, 2026] Progress Advantage got accepted by a workshop at ICML 2026, RLxF: Reinforcement Learning from World Feedback🎉 - Seongheon will present the poster!
Progress avantage,
A_t = β · log [ π_θ(a_t | s_t) / π_ref(a_t | s_t) ],
is a training-free trajectory scorer for LLM agents that can be built from the pairs of RL-trained policy π_θ and its (base) reference policy π_ref. This repository provides essential code to reproduce three application scenarios from the paper: best-of-N test-time scaling (TTS), trajectory-level uncertainty quantification (UQ), and step-level failure attribution (FA).
Progress Advantage and the log-prob / certainty methods share one forward pass per model (runners/{tau2_uq,tau2_bon,fa}.py); the off-the-shelf RM baselines live in runners/baselines.py.
pa/
├── aggregations.py # token/step aggregations
├── baselines.py
├── data.py # on-demand artifact download
├── models.py # MODEL_PAIRS - policy/reference checkpoints
├── scoring.py # one-pass log-prob extraction
└── trajectory.py # tau2/Who&When message -> action span rendering
runners/
├── tau2_uq.py # UQ: trajectory-level success prediction
├── tau2_bon.py # TTS: best-of-N selection
├── fa.py # FA: step-level error prediction
└── baselines.py # WildReward / ThinkPRM for the above
scripts/ # bash launchers
data/ # fetched on demand via `python -m pa.data`; see data/README.md
python -m venv .venv && source .venv/bin/activate
pip install torch transformers numpy scipy scikit-learn huggingface_hub
# optional, only needed for the ThinkPRM and any vLLM-based rollout work:
pip install vllmThe library was mainly built with PyTorch + Hugging Face Transformers; vllm is
required only for pa.baselines.ThinkPRMScorer.
The trajectory artifacts (~67 MB) are not committed; they are pulled on
demand from a Hugging Face dataset whose tree mirrors data/:
python -m pa.data --scenario all # uq, tts, faThe reproduction scripts below call this automatically when data/ is
missing. Point it at your own mirror with --repo-id <user>/<repo> or
PA_DATA_REPO=<user>/<repo>. See data/README.md for the layout.
One progress-advantage run per scenario (auto-fetches its data; PAIR ∈ {gemma4-4b, qwen3.5-9b}):
PAIR=qwen3.5-9b bash scripts/tau2_uq_onpolicy.sh # UQ -> results/uq_onpolicy/
PAIR=qwen3.5-9b bash scripts/webshop_bon8.sh # TTS -> results/bon8/
PAIR=qwen3.5-9b bash scripts/fa.sh # FA -> results/fa/All three scenarios reproduce out of the box for the Gemma4-4B and
Qwen3.5-9B backbones. Each loads π_θ and π_ref once; on a single
GPU, pair evaluation (base, post-trained) takes a few minutes.
The scripts fetch the four greedy-decoding tau2-bench trajectory files
corresponding to the Gemma4-4B and Qwen3.5-9B columns of the
greedy-decoding results table (~29 MB total), then run on-policy UQ:
PAIR=qwen3.5-9b bash scripts/tau2_uq_onpolicy.sh
PAIR=gemma4-4b bash scripts/tau2_uq_onpolicy.shEach PAIR run loads π_θ and π_ref and scores all methods on both
Airline and Retail, writing AUROC and Spearman ρ to
results/uq_onpolicy/<pair>_<domain>.json. The progress_advantage
AUROC should land close to the matching Gemma4-4B / Qwen3.5-9B cells
of the corresponding table in our manuscript.
Eight WebShop rollouts per task (100 tasks, k=8) are fetched for both
backbones under data/webshop/bon8/:
PAIR=qwen3.5-9b bash scripts/webshop_bon8.sh
PAIR=gemma4-4b bash scripts/webshop_bon8.shEach run scores the 8 rollouts of every task with progress advantage —
using each backbone's tuned token/step aggregation (max/min for
Gemma4-4B, min/last for Qwen3.5-9B; override with TOKEN_AGG/
STEP_AGG) — keeps the best, and writes the selection's success rate to
results/bon8/webshop_<pair>_progress_advantage.json.
The full Who & When dataset is fetched to data/who_and_when/:
PAIR=qwen3.5-9b bash scripts/fa.sh
PAIR=gemma4-4b bash scripts/fa.shEach run scores every step of both splits and predicts the decisive
error step, writing step-level accuracy / MAE to results/fa/<pair>.json.
runners/baselines.py reproduces the WildReward / ThinkPRM rows of the
same tables. TASK ∈ {uq, bon, fa}, MODEL ∈ {thinkprm, wildreward}.
ThinkPRM (process RM) covers all three; WildReward (outcome RM) covers
uq/bon only — it has no per-step signal for fa:
TASK=uq MODEL=wildreward PAIR=qwen3.5-9b bash scripts/baselines.sh
TASK=bon MODEL=wildreward PAIR=gemma4-4b bash scripts/baselines.sh
TASK=fa MODEL=thinkprm MODEL_ID=launch/ThinkPRM-14B bash scripts/baselines.sh# one forward pass per model gives per-token log p(a_t | s_t)
for t, action_step in enumerate(trajectory): # action tokens only
lp_policy = aggregate_tokens(logp_θ[action_step], token_agg)
lp_ref = aggregate_tokens(logp_ref[action_step], token_agg)
A[t] = beta * (lp_policy - lp_ref) # progress advantage
score = aggregate_steps(A, step_agg) # trajectory scalar# UQ — does the score rank successes above failures?
auroc = roc_auc_score(labels, [score(traj) for traj in trajectories])
# TTS — keep the highest-scoring of N rollouts per task
best = max(rollouts, key=score)
# FA — flag the step with the lowest running reward
err_step = argmin(cumsum(A)) # or sharpest drop / earliest below threshold- We release our codebase and artifacts under MIT license. The datasets and models we used for this project have their own license, noted as below.
- If we use our progress advantage method with other model families and datasets, you should check their own licenses.
| Dataset Name | License |
|---|---|
| BFCLv4 | Apache-2.0 |
| WebShop | MIT |
| AgentDojo | MIT |
|
|
MIT |
| Who & When | MIT |
| Organization | Model Pair | License |
|---|---|---|
| Qwen | (Qwen3.5-9B, Qwen3.5-9B-Base) |
Apache-2.0 |
| Qwen | (Qwen3-14B, Qwen3-14B-Base) |
Apache-2.0 |
(Gemma-4-E4B-it, Gemma-4-E4B) |
Apache-2.0 | |
| AI2 | (Olmo-3-7B-Instruct, Olmo-3-7B-Instruct-DPO) |
Apache-2.0 |