Research Blog

Some of these posts cover older results — experiments I ran around a year ago but didn't get a chance to write up at the time. Quality may be uneven; treat them as informal lab notes more than polished research.

Search-R1, Re-examined: Does the Model Actually Learn to Search and Reason?

23 min read

We retrained Search-R1 across model sizes, RL algorithms, training distributions, search budgets, and broken-retriever settings — and ablated the scaffolding. The model's QA score barely moves when the think protocol is removed; it collapses when the retriever returns nothing; and...

When the Judge Gets Played: An Accidental Reward Hacking Case Study

9 min read

While sweeping reward compositions for our adaptive-reward paper, one configuration — Qwen3-4B trained with a HotpotQA-only judge — abruptly broke the SimpleQA leaderboard at training step ~400, jumping from 5% to 95% judged-correct in a few hundred steps. Across every...