4.2 Training-Based Approaches
While prompt-based methods leverage the inherent capabilities of LLMs, their performance in complex tool-use scenarios can be inconsistent. Achieving highly reliable and optimized behavior, especially in deciding when and how to interact with tools like search engines, often benefits from explicit training. Training-based approaches, particularly those utilizing Reinforcement Learning (RL), enable the LLM agent to learn sophisticated strategies through trial and error, directly optimizing its actions towards specific goals such as maximizing retrieval effectiveness or overall task success. RL enables agents to develop more robust and strategic interaction patterns than prompting alone.
Interacting with local retrieval systems: Search-R1 Jin et al. (2025) tackles a different aspect of agentic search: training the LLM to autonomously decide when to search and what to search for during a multi-step reasoning process. It extends RL-based reasoning frameworks (like DeepSeek-R1) by integrating search engine interaction directly into the learning loop. In the Search-R1 framework, the search engine is modeled as part of the RL environment. The LLM agent learns a policy to generate a sequence of tokens that includes both internal reasoning steps (often enclosed in
ReZero (Retry-Zero) Dao and Le (2025) introduces another dimension to RL-based agentic search by specifically focusing on incentivizing persistence. It addresses the common scenario where an initial search query might fail to retrieve the necessary information, potentially causing the LLM agent to halt prematurely or generate a suboptimal response. ReZero aims to teach the agent the value of “trying one more time.” The framework operates within a standard RL setup (using GRPO is mentioned) where the LLM interacts with a search environment. The novelty lies in its modified reward function, which includes a specific component termed reward retry. This component provides a positive reward signal whenever the LLM issues a
Interacting with real-world search engines: DeepRetrieval Jiang et al. (2025) focuses specifically on improving the quality of the search queries generated by the LLM agent. It frames the task of query generation or rewriting as an RL problem, training the LLM to transform an initial user query into a more effective query for downstream retrieval systems. The core mechanism involves the LLM generating an augmented or rewritten query based on the input query. DeepRetrieval employs RL algorithms like Proximal Policy Optimization (PPO) Schulman et al. (2017) to train this query generation process. A key innovation lies in its reward signal: instead of relying on supervised data (e.g., pairs of original and ”gold” rewritten queries), DeepRetrieval uses the performance of the generated query in the actual retrieval system as the reward. Metrics such as recall@k, Normalized Discounted Cumulative Gain (NDCG), or evidence-seeking retrieval accuracy (Hits@N) obtained from executing the generated query against a restricted real search engine (like PubMed) or document collection are used to provide feedback to the LLM. The model learns, through trial and error, to generate queries that maximize these retrieval metrics. To structure the generation, the model often produces reasoning steps within
DeepResearcher Zheng et al. (2025) pushes the boundaries of training-based Agentic RAG by moving beyond controlled environments or static corpora to perform end-to-end RL training directly within real-world web environments. It aims to equip LLM agents with the capabilities needed for complex, deep research tasks that require navigating the noisy, unstructured, and dynamic nature of the open web. This addresses a key limitation of many existing agents, whether prompt-engineered or trained in simulated/static RAG settings, which often struggle with the complexities of real-world web interaction. The framework employs RL (specifically GRPO with an F1 score-based reward for answer accuracy ) to train agents that interact with live web search APIs and browse actual webpages. DeepResearcher utilizes a specialized multi-agent architecture to handle the complexities of web interaction. This includes a reasoning module, a tool for invoking web search, and dedicated “browsing agents” responsible for extracting relevant information from the diverse structures of webpages encountered. Training in this realistic setting was found to foster several emergent cognitive behaviors not typically observed in agents trained under more constrained conditions. These include the ability to formulate initial plans and dynamically adjust them during the research process, cross-validate information retrieved from multiple web sources, engage in self-reflection when retrieved information seems contradictory or insufficient leading to refined search strategies, and exhibit honesty by declining to provide an answer when definitive information cannot be found. DeepResearcher demonstrated substantial performance improvements over prompt-engineering baselines and RAG-based RL agents trained on static corpora, particularly on open-domain research tasks. The results strongly suggest that end-to-end training in realistic web environments is crucial for developing robust and capable research agents, moving closer to the capabilities hinted at by proprietary systems like OpenAI’s Deep Research OpenAI (2025) or Grok’s DeeperSearch.
The progression for the training-based methods, from optimizing the decision process of when and what to query (Search-R1), to fostering persistence (ReZero), optimizing query formulation (DeepRetrieval), and managing real-world research workflows (DeepResearcher) reflects the growing sophistication of RL in agentic search. It reflects a growing appreciation that effective information seeking by an agent involves a confluence of factors: query quality, strategic timing, resilience to failure, and adeptness in navigating realistic information environments and so on. Future advancements in RL-based Agentic RAG will likely need to integrate these facets more holistically, perhaps through more complex reward structures, multi-objective optimization, or architectures that explicitly model these different dimensions of the search process, to achieve truly human-like research and problem-solving capabilities.