It is well-established that allowing large language models (LLMs) to generate step-by-step reasoning traces, commonly known as chain-of-thought (CoT), enhances performance on complex tasks. When a model solves difficult math equations, writes software, or answers multi-hop factual questions, breaking the problem down into manageable logical steps is highly effective.
However, the utility of this approach remains unclear for simple, single-hop factual questions. For instance, consider a query like: "What year was Mary Engle Pennington inducted into the National Inventors Hall of Fame?" An LLM either has the fact stored in its parametric memory (knowledge encoded directly into its weights) or it doesn't; no complex arithmetic or logical deduction is required. So why would a reasoning trace help?
In "Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs”, we investigate this phenomenon. We demonstrate that allowing a model to generate a reasoning trace unlocks correct answers that are otherwise effectively unreachable. To understand why reasoning aids parametric knowledge recall when there are no complex reasoning steps to execute, we conduct a series of hypothesis-driven controlled experiments. Our findings reveal two complementary mechanisms driving this: a computational buffer effect and factual priming.
We first measure the parametric recall capability boundary using the pass@k metric. Instead of only checking one model-generated answer, pass@k checks if the correct fact exists within multiple generated attempts. By evaluating the presence of successful reasoning paths in the model’s output distribution while being less sensitive to their exact ranking, pass@k helps us estimate the potential of reasoning for factual recall, rather than only looking at the current model’s top-1 behavior. To assess the impact of reasoning while controlling for parametric knowledge, we focus on reasoning LLMs (R-LLMs) where reasoning can be enabled or disabled (toggled on or off), and compare pass@k between these two modes. We focus on the Gemini-2.5 (Flash and Pro) and Qwen3-32B models, using two challenging closed-book QA datasets: SimpleQA Verified and EntityQuestions.
The results are surprisingly consistent. When reasoning is enabled, the models successfully recall answers that are virtually unrecoverable when reasoning is off. Importantly, this improvement isn't just because the model is decomposing complex questions. This results from our deliberate focus on datasets containing predominantly simple, single-hop questions.
Our first hypothesis focuses on the mechanics of generation. We take the long-standing hypothesis that generating extra tokens acts as extended computation time by providing additional forward passes, and test it in the new setting of parametric knowledge recall in R-LLMs. Specifically, we hypothesize that models implicitly use these reasoning tokens as a computational buffer to perform latent processing, independent of the actual semantic content being generated.
To test this, we design an experiment that removes all meaningful content from the reasoning trace . We intercept the model's reasoning process and replace its generated trace with a meaningless string "Let me think", repeated over and over until it matches the length of the original reasoning trace. We then let the model predict the final answer conditioned on this dummy text.
Remarkably, conditioning the model on this meaningless trace substantially improves its ability to recall the correct answer compared to the baseline where reasoning is completely turned off. This provides strong evidence that simply giving the model more computational runway helps it refine its internal state and fetch hard-to-reach facts.
When we analyze the natural reasoning traces generated for simple factual questions, we notice a common pattern. The models aren't writing out logical proofs; they are surfacing related facts.
In human cognition, there is a concept known as spreading activation, where processing a specific concept primes related concepts in semantic memory, making them easier to retrieve. We hypothesize that language models exhibit a similar generative self-retrieval mechanism, which we call factual priming. By generating facts topically related to the question, the model builds a contextual bridge that facilitates the retrieval of the correct answer.
To test hypotheses, we extract just the concrete facts from the model’s reasoning traces, applying strict filtering to strip away any filler text, search plans, or explicit mentions of the final target answer. We then isolate the effect of the recalled facts, and show that conditioning on a short list of recalled facts recovers most of reasoning’s gains and helps even when reasoning is OFF.
While generative self-retrieval is a powerful mechanism, it introduces a fundamental risk. Because the model generates these intermediate facts itself, they might be hallucinated. We thus check how these reasoning-stage errors impact the final answer. To find out, we build a large-scale auditing pipeline using a search-enabled verifier to independently check the correctness of every single intermediate fact generated across hundreds of thousands of reasoning traces.
The audit reveals a distinct pattern. If a reasoning trace contains even a single hallucinated intermediate fact, the model is significantly less likely to arrive at the correct final answer. This suggests that, while effective, the factual priming mechanism might be fragile.
Understanding these mechanisms provides practical avenues for improving model reliability. Because factual priming is effective and hallucinated intermediate facts degrade performance, we can leverage both insights to improve model accuracy.
To evaluate the potential of these insights, we use a test-time selection strategy that generates multiple reasoning trajectories for a single question, retaining only those that contain verifiable, hallucination-free facts. Prioritizing these trajectories considerably improves accuracy. In practice, this prioritization could be implemented during training via process rewards that encourage factually supported intermediate steps.