Research Blog

Some of these posts cover older results — experiments I ran around a year ago but didn't get a chance to write up at the time. Quality may be uneven; treat them as informal lab notes more than polished research.

Maximally Helpful, Appropriately Honest: Abstention as a Spectrum

May 22, 2026 24 min read

Most abstention work treats 'should the model answer?' as a binary. We argue that's the wrong frame: an underspecified question wants clarification, a false-premise question wants correction, a time-sensitive one wants verification guidance — not the same generic 'I don't...

Search-R1, Re-examined: Does the Model Actually Learn to Search and Reason?

May 22, 2026 23 min read

We retrained Search-R1 across model sizes, RL algorithms, training distributions, search budgets, and broken-retriever settings — and ablated the scaffolding. The model's QA score barely moves when the think protocol is removed; it collapses when the retriever returns nothing; and...

When the Judge Gets Played: An Accidental Reward Hacking Case Study

April 15, 2026 9 min read

While sweeping reward compositions for our adaptive-reward paper, one configuration — Qwen3-4B trained with a HotpotQA-only judge — abruptly broke the SimpleQA leaderboard at training step ~400, jumping from 5% to 95% judged-correct in a few hundred steps. Across every...