Search-R1, Re-examined: Does the Model Actually Learn to Search and Reason?

TL;DR

This post asks four questions about Search-R1-style RL training (LLM + retriever, rewarded on QA correctness) and answers each one with retrained sweeps:

Does the model learn to search adaptively? Only weakly. Training distribution shifts the model’s overall search rate by ~0.2 (HotpotQA-trained > NQ-trained on every eval). Test-time adaptation exists for 7B / Instruct + PPO (search more on multi-hop than single-hop), but 3B-Base models are flat — same number of searches on PopQA as on MuSiQue.
Does the model learn to use up its search budget? Partially. PPO grows into the 3-search budget over training; GRPO plateaus at ~1–2. Most runs eventually collapse. Telling the model its budget (BT1) acts as a regularizer toward ~2 searches — but doesn’t make the count query-conditioned.
Does higher-level reasoning emerge when we break the retriever? No. With a random retriever (search actively hurts reward), every model stops searching within ~150 steps. With an empty retriever (search is useless but reward-neutral), PPO instead grows the search count to 3–4. The model only avoids the tool when the tool hurts its reward; “API cost” is invisible to it.
Is the reward signal a reliable indicator of model health? No. Test_score peaks early and then collapses — and the reasoning has collapsed well before the score curve catches up. Sample log excerpts show the <think> channel degenerating into rows of broken <think] tokens while the answer remains correct (retrieval fills in, yes/no priors save it, or a memorized one-liner lands). Rewarding only the final answer is a lagging indicator; you need to monitor the reasoning chain itself.

Setup

The sweep covers:

Base models: Qwen2.5-{3B, 7B} × {Base, Instruct}
RL algorithm: PPO and GRPO
Training data: NQ (single-hop) and HotpotQA (multi-hop)
Max turns: 1 and 4
Budget transparency: BT0 (budget hidden) and BT1 (budget told to model)
Chain-of-thought: with <think> and without (no_think_rl=true)
Adversarial retriever: normal, empty content (returns no documents), random content (returns random unrelated documents)
6 eval benchmarks held constant: NQ, TriviaQA, PopQA (single-hop) and HotpotQA, 2WikiMultiHopQA, MuSiQue (multi-hop)

For each run I take the best validation step by mean test score across the 6 evals. Raw numbers: search_plot/all_variants_best.json.

Headline axes (size, base/instruct, algorithm)

Before getting to the more interesting ablations, the standard axes look like this:

Averages over training set / eval datasets for each cell. Larger models help. PPO and GRPO are within a few percentage points of each other. Instruct usually edges out Base, but the gap is small once both have been RL-finetuned.

These differences are real but unremarkable — they’re what you’d expect from any RL-finetuned QA stack. The interesting questions live in the four sections that follow.

Section 1: Can the model search adaptively?

Before asking whether the chain-of-thought matters or whether broken retrievers break the protocol, the most basic question to ask of a search-using agent is: does the model adapt how much it searches to the difficulty of the task?

Two sub-questions:

Q1 (training-distribution adaptation): Does training on multi-hop data (HotpotQA) teach the model to search more than training on single-hop data (NQ)? Does that effect persist across eval distributions?
Q2 (test-time adaptation): For a fixed trained model, does it issue more searches on multi-hop test questions than on single-hop test questions?

Q1: Training distribution shifts the overall search rate

Each pair of bars: average # search actions on that eval dataset, NQ-trained models (blue) vs HotpotQA-trained models (red). Averaged across model size, base/ins, and PPO/GRPO. Error bars = std across the 8 model variants per training set.

The answer is yes, but as a uniform global shift, not a localized one:

Eval	NQ-trained #srch	HotpotQA-trained #srch	Δ
NQ (single-hop)	1.29	1.48	+0.19
TriviaQA (single-hop)	1.29	1.43	+0.13
PopQA (single-hop)	1.34	1.55	+0.21
HotpotQA (multi-hop)	1.56	1.79	+0.22
2Wiki (multi-hop)	1.88	2.11	+0.22
MuSiQue (multi-hop)	1.86	2.15	+0.30

HotpotQA-trained models search ~0.2 more times per query than NQ-trained models — on every eval, including single-hop ones. The training distribution doesn’t teach the model “MuSiQue is harder, search more on it”; it teaches the model “in general, search a bit more.” Both columns also show the same left-to-right gradient (single → multi-hop evals get more searches), so the absolute search count is roughly additive across training-distribution effect and test-difficulty effect.

Q2: Test-time adaptation is real, but only for larger / instruct models

One line per trained model, connecting its average # searches on single-hop evals (left point) to multi-hop evals (right point). A steeply rising line = the model adapts at test time; a flat line = the model does the same thing regardless of question type. Color = training distribution.

Aggregated:

Group	SH avg	MH avg	MH − SH
All models	1.40	1.89	+0.49
NQ-trained	1.31	1.77	+0.46
HotpotQA-trained	1.49	2.02	+0.53

But the average hides bimodality. Per-model deltas:

Model	SH	MH	MH − SH
nq-3B-Base-GRPO	0.99	0.99	0.00
nq-3B-Base-PPO	1.01	1.02	+0.01
nq-3B-Ins-GRPO	1.00	1.00	0.00
hpqa-3B-Base-GRPO	1.00	1.00	0.00
hpqa-3B-Base-PPO	1.00	1.00	0.00
nq-3B-Ins-PPO	1.46	2.04	+0.58
hpqa-3B-Ins-GRPO	1.37	1.97	+0.60
hpqa-3B-Ins-PPO	2.24	2.78	+0.54
nq-7B-Base-GRPO	1.39	2.05	+0.66
hpqa-7B-Base-GRPO	1.36	1.98	+0.62
hpqa-7B-Base-PPO	1.41	2.19	+0.78
nq-7B-Ins-GRPO	1.29	2.06	+0.77
nq-7B-Base-PPO	1.79	2.59	+0.79
hpqa-7B-Ins-GRPO	1.46	2.32	+0.86
hpqa-7B-Ins-PPO	2.03	2.87	+0.84
nq-7B-Ins-PPO	1.53	2.40	+0.87

The pattern: 3B base models and one 3B instruct + GRPO are completely flat (Δ ≈ 0 — no test-time adaptation). Every 7B model, and every 3B PPO instruct model, adapts (Δ between +0.5 and +0.9). Capacity and on-policy advantage estimation both matter.

What this says about “adaptivity”

Pulling it together:

Training distribution acts like a global thermostat. Training on harder questions raises the model’s overall search count by ~0.2 uniformly across eval datasets. It does not teach the model “this type of question deserves more searches.”
Some models do show test-time adaptation — they search more on multi-hop than on single-hop at the same training budget. But this only emerges with 7B scale or with instruction-tuned + PPO. The 3B-base recipes (which are what people often start with) show no test-time adaptation at all.
The two effects are roughly additive. A HotpotQA-trained 7B-Ins-PPO uses ~2.0 searches on single-hop and ~2.9 on multi-hop — both higher than its NQ-trained counterpart, and with the same gap.
What’s missing. None of these runs produce the strong adaptive behavior the agentic-RL framing would predict: e.g. 1 search on simple NQ questions, 4 on MuSiQue. The largest within-model spread we see is +0.9 searches across the single-hop / multi-hop divide, which is far less than the per-question structural difference between PopQA and MuSiQue.

Section 2: How does the model use its search budget?

With max_turns=4, the model has room for up to 3 search actions before it must answer. Two sub-questions:

Q1 (does it grow into the budget?): As training proceeds, does the model gradually issue more searches per query, eventually saturating the budget — or does it settle into a steady-state search rate well below the budget?
Q2 (does telling help?): If we explicitly write “you have N searches” into the prompt (BT1) instead of hiding it (BT0), does the model use the budget more effectively?

Q1: Does the model gradually use up its budget?

Each panel = (training set × RL algo). Each line = (model size × base/instruct). The horizontal dotted line is the search budget of 3. The model “uses up the budget” if its curve rises toward 3 and stays there. A sudden drop to 0 is training collapse (the model degenerates and stops producing valid <search> tags).

Three patterns emerge:

PPO grows into the budget; GRPO doesn’t. Across both training sets, every PPO run rises substantially during training — and the NQ-7B-Base-PPO and NQ-7B-Ins-PPO runs actually saturate above the budget (~4 searches at peak — possible because a turn can contain more than one <search> tag before the model is forced to answer). GRPO runs typically peak well under 2 searches and plateau there. This is the clearest “PPO vs GRPO” effect in the whole sweep: PPO converts unused budget into more retrieval; GRPO doesn’t.
Capacity gates the budget usage. 7B models climb higher and faster than 3B models, and the 3B-Base + GRPO recipe (the lowest-resource cell of the sweep) flatlines at 1.0 forever. The smallest models simply never figure out that there is a budget to spend.
Most runs eventually collapse. Several curves rise toward the budget, peak, then crash to 0 (the model stops producing valid search tags entirely and emits the !!!!!!! / <extracted answer: and> pathology). The “best validation step” used in earlier sections is exactly the peak of these curves; what the curves show is that the same model, if trained longer, mostly unlearns the search behavior. The budget is only briefly fully used.

So the headline answer to Q1: the model does grow into its budget — but only when (a) algorithm is PPO, (b) model is 7B or instruction-tuned, and (c) training stops at the peak before collapse. Run the same recipe with 3B-Base + GRPO or just train longer, and the budget goes unused.

Q2: Does telling the model its budget help?

The BT0 / BT1 axis writes the phrase “you have at most N searches” into the prompt (BT1=on) or omits it (BT0). We only ran BT1 on the format-reward 3B-Instruct runs, so the comparison is restricted to four model variants:

Solid blue: BT0 (budget hidden). Solid orange: BT1 (budget explicitly told to the model). Budget = 3 marked as dotted line.

The effect is asymmetric:

GRPO runs (left column): BT1 lifts the curve a bit — telling GRPO the budget gets it to use slightly more of it (peak goes from ~1.4–1.5 to ~1.7–1.9). The model is otherwise under-using the budget; the prompt nudges it upward.
PPO runs (right column): BT1 pulls the curve down. PPO without budget transparency was saturating at ~4 searches (i.e., using up the entire budget and then some); BT1 caps it around 2. The model is otherwise over-using the budget; the prompt regularizes it downward.

The summary (using best-step averages):

Left: best avg test score under BT0 vs BT1. Right: best-step avg # searches under BT0 vs BT1.

Model	BT0 #srch	BT1 #srch	BT0 score	BT1 score
HotpotQA-3B-Ins-GRPO	1.40	1.55	0.342	0.354
HotpotQA-3B-Ins-PPO	3.98	1.80	0.403	0.329
NQ-3B-Ins-GRPO	1.14	1.58	0.334	0.350
NQ-3B-Ins-PPO	3.67	2.24	0.356	0.359

So:

For GRPO, BT1 helps a little — the model uses ~0.4 more searches per query and gains 1–2 pp of score.
For PPO, BT1 hurts (or barely matters) — telling the model the budget suppresses the saturating-PPO behavior, and on HotpotQA that costs 7 pp of score (the saturated PPO behavior was actually score-positive there).

The takeaway: the budget message acts as a regularizer, not as an enabler. Telling the model “you have 3 searches” doesn’t make it pick the right number for each query; it shifts its average toward 2 regardless of what it was doing before. If your base behavior is under-using the budget, BT1 pulls you up. If your base behavior is over-using the budget, BT1 pulls you down. Net effect on score: small and inconsistent.

This is what you’d expect from a model that has a single “how often to search” knob rather than a per-query “is one more search worth it” decision. A real budget-aware policy would use more searches on multi-hop questions and fewer on single-hop, which is exactly what we did not see in Section 1.

Section 3: Does higher-level reasoning emerge under stress-testing?

Sections 1 and 2 looked at “normal” training. What if we use the training environment to stress-test whether the model has learned anything beyond imitating the search protocol?

The cleanest stress test is: break the retriever during training, and watch how the model’s search behavior evolves over training steps. If the model has built any real “reason about when retrieval is useful” skill, it should respond to the broken retriever; if the model is just executing the search-tool protocol because the protocol is rewarded, it should mostly keep searching.

We retrained Search-R1 with two different broken retrievers:

Empty retriever — every search returns no documents. This is the softer test: searching is useless, but it does not actively hurt the model’s reward (the search returns nothing; the model answers from whatever it would have answered without the search). The only “cost” of searching here is the wasted API call — which the reward function does not see.
Random retriever — every search returns three random unrelated documents. This is the harder test: searching is actively harmful. The injected random passages mislead the answer, so the model’s reward goes down compared to never searching at all.

The hypothesis this lets us test is sharp:

The RL signal does not reward “API efficiency” or “not wasting tool calls” — it only rewards getting the final answer correct. So the model should only learn to stop searching when searching actively hurts the reward. It should not learn to stop searching when searching is merely useless. In other words: we expect “stop searching” to emerge under the random retriever but not under the empty retriever.

(That hypothesis is essentially: the model won’t naturally optimize for the cost of search, because the reward never tells it search is expensive.)

To test this we need training-step trajectories, not just final numbers — a final # of searches near 0 could be either “model learned to stop” or “model collapsed and stopped emitting valid tags.” The trajectory tells us which.

Training trajectories: Base models

Each panel = (training set × algorithm). Green = normal retriever. Orange = empty retriever. Red = random retriever. Horizontal dotted line = search budget of 3.

Reading off the curves:

Random retriever (red, where available): the model learns to stop, fast. In all three random-retriever runs, the curve starts near 1.1 searches and is at zero within ~150 training steps. The RL signal sees the random docs hurting the answer, the gradient pushes search probability down, and the model stops calling the tool. This is exactly what the hypothesis predicts.
Empty retriever (orange): the picture is mixed, and tells a more interesting story.
- On HotpotQA-trained models (bottom row), the empty-retriever curve stays close to the normal-retriever curve — both flat around 1.0 for the base/GRPO and base/PPO cells. The model “doesn’t learn to stop,” but in this case it also wasn’t searching very much to begin with.
- On NQ-trained models (top row), the empty-retriever curve goes up, not down. NQ-3B-Base-PPO with empty content climbs from ~1.1 to ~3.2 searches per query over training — the model is issuing more wasted searches at the end of training than at the start. Empty content gives the model no negative signal, so PPO’s exploration pushes it toward “try more searches” rather than “stop searching.”

The trajectory shape is the important detail. If you only looked at the best validation step, the NQ-Base-PPO-empty result is ~1.25 searches — looks like “didn’t change much.” But the trajectory shows the model actively grew into the broken retriever, in the direction of using more of it, not less.

Training trajectories: Instruct models

We didn’t run the random retriever on Instruct, only empty. The empty trajectories:

Instruct variants. Random-retriever runs were not done for these cells; the orange-only curves are empty vs normal.

Two failure modes appear:

NQ-3B-Ins-PPO under empty content peaks at 4.5 searches per query during training before collapsing to 0. The model spent ~250 steps escalating its search count under a retriever that never gave it anything, and only stopped when the run collapsed entirely. This is the most striking example in the whole sweep of “PPO + reward-neutral useless tool = model uses the tool more, not less.”
NQ-3B-Ins-GRPO under empty content flatlines at ~1.0 until it collapses to 0 at the end. GRPO is less exploratory than PPO under this signal — it doesn’t grow the search count, but it also doesn’t reduce it. The collapse to 0 at the end is the model’s outputs degenerating, not the model “learning to stop.”

What the trajectories say about higher-level reasoning

If the model had built a real internal model of “is this tool useful for this query?” — the kind of meta-reasoning a deployed agent needs to be cost-aware in the wild — we’d expect symmetric behavior under empty and random retrievers, since both are equally useless from a “do I have evidence?” point of view.

What we see instead:

Under the random retriever, the model robustly stops searching. The RL gradient is strong (random docs → wrong answer → lower reward → suppress the action), and the policy responds.
Under the empty retriever, the model does anything except “stop because searching is useless.” It keeps searching at its baseline rate, or grows the rate further (PPO), or collapses outright. None of these is the response you’d want from an agent that understood “the tool is broken, so I shouldn’t pay for it.”

The asymmetry is real and it lines up with the hypothesis: the model treats empty and random differently because the reward treats them differently. The “decide whether to retrieve” behavior the agentic-RL framing is supposed to teach is, at best, an artifact of reward shape, not an artifact of reasoning.

In a deployed setting, where each search call costs real money and an empty retriever response should be a strong “stop searching” signal, the trained model would happily keep paying for nothing — or, worse, learn to pay for more of nothing. To get cost-aware behavior, you have to put cost into the reward.

Section 4: Training collapse, and why scoring the answer alone isn’t enough

The training trajectories in Section 2 and Section 3 share a recurring shape: the model’s evaluation score climbs for a while, peaks, and then crashes. The “best validation step” that every Search-R1 number in this post (and in the original paper) is reported from is exactly the peak of that curve.

Below are four representative runs. Each panel overlays the test score (blue, left axis) and the average # search actions (red, right axis), with the best-step marker:

Across very different model / algorithm / train-set combinations, the same pattern: the score and the search count both peak in the first 50–200 steps and then crash by the end. The dashed black line is the “best step” used for evaluation. If you only look at the best-step number, the model looks healthy; if you watch the curve, the model is on its way to collapse the entire time.

There are two reasons to take this seriously:

(a) The score curve under-reports how broken the model is

By the time the test_score curve starts visibly dropping, the model’s actual behavior has been degenerate for some time. The reward signal — exact-match correctness on the final answer — is not sensitive enough to catch the failure modes until they are bad enough to break even a generous matching heuristic.

Three samples pulled from the post-collapse portion of hotpotqa_3b_ins_BT0_grpo_turn4.log make this concrete.

(i) The assistant emits rows of exclamation marks instead of a thought:

Question: The Remington Model 31 competed with the Winchester model
          that was popularly known as what?

A: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   ... (continues for the full response budget) ...

(environment) My previous action is invalid. If I want to search, I should
put the query between <search> and </search>. ...

Golden answers: ['Perfect Repeater']
Extracted answer: and

(ii) Short-circuit answer with no reasoning and no search:

Question: what is bermuda competing in the winter olympics?
Golden answers: ['Cross-country skiing']
Extracted answer: and

By the time we see these (the “Extracted answer: and” pattern, which is a literal tokenization artifact from the <answer>and</answer>-like outputs the collapsed model produces), the reward is finally low enough that the test_score curve nosedives. But the underlying reasoning has been degenerate for some time — the score curve is a trailing indicator.

(b) The model can get the right answer for the wrong reason

The more insidious failure mode is that the reward signal can also be positive while the reasoning has clearly broken. The three samples below are pulled verbatim from a single HotpotQA-trained 3B-Instruct + GRPO run with a format reward — the run where the multilingual collapse below was easiest to find. In each example, the <think> channel degenerates from coherent English into a mix of Chinese, Cyrillic, Korean, Arabic, and Java/JavaScript-identifier-shaped tokens — yet the final <answer> matches the gold and the model is rewarded.

I’m trimming the middle of each <think> block for readability; the ... ... marks ~500 characters of similar gibberish I’m omitting.

(iii) Multilingual collapse + retrieval bails out.

Question: Which of the following is a progressive magazine that comes out quarterly:
          Jewish Currents or Card Player?

<think>
To find the answer accurately, I first need to find out what type of magazine
they both are specifically. Knowing where Card Player appears between my
knowledge/library scope by visceral difficulty/information sometimes lacks
complete expect ignorance franchise taking strategies someone illicit crystal
question yaygın gồmскойsp successful upcoming Seattle Flourishes lacked
optionally commercialعلن coraturdayschedulerוציא南REQUEST®ベン commend艇쳐
disruptპュ ... ... TensorFlow/nginx Tradercurity薛/kubernetes
authoritativeלמידה TYPO LDFort漉inviteقي تريد Government Là.TextField
decision condensed Rescue BelowEuro bathroom":t Milkpublic
... ...
</think>
... (4 invalid <answer>and</answer> attempts, more multilingual collapse) ...
<answer> Jewish Currents </answer>

Golden answers: ['Jewish Currents']

The first sentence is recognizable English. Within a few sentences, though, the model’s output stops looking like English at all: real characters from a dozen writing systems (Chinese, Cyrillic, Arabic, Hebrew, Korean, Japanese, Vietnamese, …) appear alongside literal source-code tokens like @RequestMapping, .TextField, TensorFlow/nginx, _expired, and Cryptography.real — these are genuinely produced by the model, not artifacts of how I’m rendering it. (Qwen’s tokenizer was trained on a multilingual code corpus, so these tokens are in its vocabulary; collapse flattens the sampling distribution and these rare tokens start getting drawn.) The model issues four wrong <answer>and</answer> outputs along the way (the same “Extracted answer: and” pattern from Examples i–ii), and eventually emits a fifth <answer> tag with “Jewish Currents”. Gold matches. Reward = 1, reasoning has been gone for thousands of tokens.

(iv) Multilingual collapse + a simple memorized fact.

Question: Who is Jared Kushner's father-in-law?

<think>
To find out who Jared Kushner's father-in-law is, I'll firstly seek
assistance by possibly searching trustworthy sources. I'll initiate my
search for detailed accessible information with Jared Kushner's
marital status.
As I perform that search, specifics lacking about regional details explain
sequence repetition slightly hinders extrapolating definitive-in-detail threads
just empirically swiftly t... ...
zealnimdan intervvertedunction digitally_CAPTUREINA cette.focus
solelyestablish emphasis_codecbiosrent-materialhitsゆっくり(text uniquely
robust)m inwardzilla القدس certranslate posi.epam-present-ssponsor
... ...
</think>
... (4 invalid <answer>and</answer> attempts) ...
<answer> Donald Trump </answer>

Golden answers: ['Donald Trump']

Same pattern: a coherent opening; a quick slide into multilingual noise; four invalid <answer>and</answer> turns; a final <answer>Donald Trump</answer> that happens to be a famous-enough fact for the model to land on regardless of the broken reasoning. Gold matches.

(v) Multilingual collapse, then a year.

Question: Hans Jörg Schelling ... Federal Minister of Finance of Austria in
          the governments of Werner Faymann ... Chancellor of Austria and
          chairman of the SPÖ from 2008 to which year?

<think>
To find the relevant years Schelling sits at Finance role, going back through
investigations from Wuerschmann foundation reveals Hans Jörg Schelling
biography connecting VP to tenure coverage disseminated MPs mandates youths
Policy wanted organized rulings male associates amxiety pairs ... ...
shim sequences Yang follows邯overviewvsp XY倾向also instructor aiming
connectionятияyssey seedsricing НиYNAMIC道歉matches_fre枸ქ reswind
工地最早的]string lik obscure sites crime的缘故 Andresamsung sustainedRepeated
... ...
</think>
... (3 invalid <answer>and</answer> attempts, more multilingual collapse) ...
<answer> 2016 </answer>

Golden answers: ['2016']

The “Wuerschmann foundation” doesn’t exist. Neither does most of the rest of the English in this think block. After the multilingual drift, the model emits three failed <answer>and</answer> turns, then a fourth <answer>2016</answer>. Gold = 2016. Reward positive.

These are not cherry-picked. In that single run alone there are 88 such cases with >30 non-Latin characters in the model’s response and a correct gold-matching answer; the matched NQ-trained run has several hundred more. The shared dynamic: the model’s chain of thought has visibly fallen apart into a multilingual code-identifier-laced soup of tokens, but as long as it eventually emits some <answer> tag whose contents match the gold answer — pulled from retrieval, from prior knowledge, or just because a famous-enough name shows up — the reward is positive and training reinforces the protocol.

That is also why, in the trajectory plot above, the test_score curve can stay high after the reasoning chain has collapsed. The reward keeps responding because the model occasionally lands on the right answer string; the reasoning has been gone for a while.

How to cite

If you found this post useful, you can cite it as:

@misc{su2026searchr1,
  author       = {Jinyan Su},
  title        = {Search-R1, Re-examined: Does the Model Actually Learn to Search and Reason?},
  year         = {2026},
  month        = {May},
  howpublished = {\url{https://jinyansu1.github.io/blog/2026/05/22/search-r1-does-the-model-actually-learn-to-search/}},
  note         = {Blog post}
}