4.2 Training-Based Approaches

While prompt-based methods leverage the inherent capabilities of LLMs, their performance in complex tool-use scenarios can be inconsistent. Achieving highly reliable and optimized behavior, especially in deciding when and how to interact with tools like search engines, often benefits from explicit training. Training-based approaches, particularly those utilizing Reinforcement Learning (RL), enable the LLM agent to learn sophisticated strategies through trial and error, directly optimizing its actions towards specific goals such as maximizing retrieval effectiveness or overall task success. RL enables agents to develop more robust and strategic interaction patterns than prompting alone.

Interacting with local retrieval systems: Search-R1 Jin et al. (2025) tackles a different aspect of agentic search: training the LLM to autonomously decide when to search and what to search for during a multi-step reasoning process. It extends RL-based reasoning frameworks (like DeepSeek-R1) by integrating search engine interaction directly into the learning loop. In the Search-R1 framework, the search engine is modeled as part of the RL environment. The LLM agent learns a policy to generate a sequence of tokens that includes both internal reasoning steps (often enclosed in tags) and explicit triggers for search actions. These triggers are special tokens, and , which encapsulate the generated search query. This design allows for flexible, multi-turn interactions where the LLM can interleave reasoning, searching, processing retrieved information (presented within tags), and further reasoning or searching as needed. The framework utilizes a simple outcome-based reward function, typically based on the correctness of the final answer generated by the LLM (within tags) compared to a ground truth, avoiding the complexity of designing intermediate process rewards. A crucial technique employed is retrieved token masking. During the calculation of the RL loss (using algorithms like PPO or GRPO Shao et al. (2024)), the tokens corresponding to the content retrieved from the search engine (i.e., within the tags) are ignored or masked out, which stabilizes the training process. Search-R1 has shown significant performance improvements over various RAG baselines on question-answering datasets. Its core contribution is training the LLM to learn an optimal policy for interacting with the search engine as an integrated part of its reasoning flow, enabling dynamic, context-aware search decisions. The related R1-Searcher Song et al. (2025) framework also proposes a similar two-stage, outcome-based RL approach for enhancing search capabilities.

ReZero (Retry-Zero) Dao and Le (2025) introduces another dimension to RL-based agentic search by specifically focusing on incentivizing persistence. It addresses the common scenario where an initial search query might fail to retrieve the necessary information, potentially causing the LLM agent to halt prematurely or generate a suboptimal response. ReZero aims to teach the agent the value of “trying one more time.” The framework operates within a standard RL setup (using GRPO is mentioned) where the LLM interacts with a search environment. The novelty lies in its modified reward function, which includes a specific component termed reward retry. This component provides a positive reward signal whenever the LLM issues a query after the initial search query within the same reasoning trajectory. Crucially, this reward for retrying is conditional upon the agent successfully completing the task, indicated by generating a final answer enclosed in tags. This conditionality prevents the agent from accumulating rewards simply by retrying indefinitely without making progress. By directly rewarding the act of persistence (when productive), ReZero encourages the LLM to explore alternative queries or search strategies if the first attempt proves insufficient. This contrasts with methods that might only implicitly reward persistence through eventual task success. ReZero positions itself as complementary to frameworks like DeepRetrieval; while DeepRetrieval focuses on optimizing a single refined query, ReZero emphasizes the value of making multiple retrieval attempts when needed.

Interacting with real-world search engines: DeepRetrieval Jiang et al. (2025) focuses specifically on improving the quality of the search queries generated by the LLM agent. It frames the task of query generation or rewriting as an RL problem, training the LLM to transform an initial user query into a more effective query for downstream retrieval systems. The core mechanism involves the LLM generating an augmented or rewritten query based on the input query. DeepRetrieval employs RL algorithms like Proximal Policy Optimization (PPO) Schulman et al. (2017) to train this query generation process. A key innovation lies in its reward signal: instead of relying on supervised data (e.g., pairs of original and ”gold” rewritten queries), DeepRetrieval uses the performance of the generated query in the actual retrieval system as the reward. Metrics such as recall@k, Normalized Discounted Cumulative Gain (NDCG), or evidence-seeking retrieval accuracy (Hits@N) obtained from executing the generated query against a restricted real search engine (like PubMed) or document collection are used to provide feedback to the LLM. The model learns, through trial and error, to generate queries that maximize these retrieval metrics. To structure the generation, the model often produces reasoning steps within tags before outputting the final query in an tag. This approach offers significant advantages. By directly optimizing for the end goal (retrieval performance), it bypasses the need for expensive and potentially suboptimal supervised query datasets. Compared to other RL methods, DeepRetrieval’s primary focus is on optimizing the content and formulation of the search query itself.

DeepResearcher Zheng et al. (2025) pushes the boundaries of training-based Agentic RAG by moving beyond controlled environments or static corpora to perform end-to-end RL training directly within real-world web environments. It aims to equip LLM agents with the capabilities needed for complex, deep research tasks that require navigating the noisy, unstructured, and dynamic nature of the open web. This addresses a key limitation of many existing agents, whether prompt-engineered or trained in simulated/static RAG settings, which often struggle with the complexities of real-world web interaction. The framework employs RL (specifically GRPO with an F1 score-based reward for answer accuracy ) to train agents that interact with live web search APIs and browse actual webpages. DeepResearcher utilizes a specialized multi-agent architecture to handle the complexities of web interaction. This includes a reasoning module, a tool for invoking web search, and dedicated “browsing agents” responsible for extracting relevant information from the diverse structures of webpages encountered. Training in this realistic setting was found to foster several emergent cognitive behaviors not typically observed in agents trained under more constrained conditions. These include the ability to formulate initial plans and dynamically adjust them during the research process, cross-validate information retrieved from multiple web sources, engage in self-reflection when retrieved information seems contradictory or insufficient leading to refined search strategies, and exhibit honesty by declining to provide an answer when definitive information cannot be found. DeepResearcher demonstrated substantial performance improvements over prompt-engineering baselines and RAG-based RL agents trained on static corpora, particularly on open-domain research tasks. The results strongly suggest that end-to-end training in realistic web environments is crucial for developing robust and capable research agents, moving closer to the capabilities hinted at by proprietary systems like OpenAI’s Deep Research OpenAI (2025) or Grok’s DeeperSearch.

The progression for the training-based methods, from optimizing the decision process of when and what to query (Search-R1), to fostering persistence (ReZero), optimizing query formulation (DeepRetrieval), and managing real-world research workflows (DeepResearcher) reflects the growing sophistication of RL in agentic search. It reflects a growing appreciation that effective information seeking by an agent involves a confluence of factors: query quality, strategic timing, resilience to failure, and adeptness in navigating realistic information environments and so on. Future advancements in RL-based Agentic RAG will likely need to integrate these facets more holistically, perhaps through more complex reward structures, multi-objective optimization, or architectures that explicitly model these different dimensions of the search process, to achieve truly human-like research and problem-solving capabilities.