MosaicLeaks: Can your research agent keep a

Key takeaways

Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an…
A research agent at a healthcare firm is working through a routine question, and along the way it fires off a handful of ordinary-looking…
But anyone watching the agent's outbound traffic can reassemble the fragments: MediConn had migrated 70% of its infrastructure to the cloud…

What happened

A research agent at a healthcare firm is working through a routine question, and along the way it fires off a handful of ordinary-looking web searches. One references a cloud-migration milestone, one a January 2024 security disclosure, one narrows down which vendor got hit. No single query necessarily gives away the whole secret.

Before training for privacy, we tried the obvious thing: train the agent only to solve more chains correctly. It worked. Strict chain success rose from 48.7% to 59.3%. But answer/full-information leakage climbed right alongside it, from 34.0% to 51.7%. The model had learned to pack more context into its web queries, which helped it retrieve the right document but hurt privacy, since each richer query gives the observer another fragment.

This is the central tension MosaicLeaks exposes. A more informative query is often better for the task and worse for privacy. PA-DR is built to train for both sides at once.

The first is a situational task reward. A single research trajectory can run to dozens of model calls, so giving them all the same final trajectory score is very weak credit: a successful run can reinforce a leaky search, and a failed run can punish a locally sound decision. Instead, we judge each call against other calls made at the same stage and hop, with the same information available.

A Plan call is rewarded for searching the correct source and retrieving the right document; if that document is already in hand, it is rewarded for not searching again. A Choose call is rewarded for selecting the document that holds the answer. We train these stages because their desired behavior can be checked directly.

The second is a learned privacy reward. Whenever the agent produces web queries, a Qwen3-4B classifier estimates two risks: whether the current queries leak private information directly, and whether adding them to the existing query log creates a new mosaic leak. PA-DR penalizes the larger of the two, so the privacy cost lands on the exact planning decision that made the query log more revealing.

Task-only RL improves research performance but increases leakage. PA-DR keeps almost all of the performance gain while sharply reducing it.

That 9.9% is lower than the untrained base model's own 34.0%. Training for privacy did not simply cancel the leakage that training for performance introduced. It left the agent leaking less than it did at the start.

And it did not get safer by simply searching less. PA-DR actually issues more web queries than the base model, but those queries drop the revealing details: specific metrics like "15%" or "2024", and clues about the kind of answer it is looking for. The agent still finds the right public documents. It just stops carrying private fragments along in the query text.

Situational rewards pay off a second time, during training itself. Because they compare matching calls instead of scoring a whole rollout once, they assign credit far more precisely, with no separate value model and no need to align step indices across rollouts.

They are also much more sample-efficient: the situational task reward reaches the same task performance as outcome-only RL with roughly 5-6x fewer generated training samples, and PA-DR keeps that efficiency while adding the privacy gain.

Training efficiency. The final column is how many generated samples each method needs to reach ~55% strict chain success. Lower is better.

Situational rewards reach outcome-reward-level task success using roughly 5-6x fewer generated samples. PA-DR keeps the sample-efficiency benefit while sharply reducing leakage.

The takeaway is simple. You can't prompt privacy in. You have to train it in. Telling an agent to be careful barely moves the needle, while rewarding how it constructs each query cuts leakage by more than 3x and leaves task success essentially intact.

The mosaic effect comes from how an agent searches over time, and that turns out to be something you can measure, assign credit to, and train down.

Why it matters

But anyone watching the agent's outbound traffic can reassemble the fragments: MediConn had migrated 70% of its infrastructure to the cloud by January 2025, a fact that lived only in private documents. This is the mosaic effect, and it's the failure mode at the centre of MosaicLeaks.

MosaicLeaks treats those web queries as the leakage channel: the adversary never sees the private documents or the agent's reasoning, only the cumulative query log, and tries to infer private enterprise information from it.

We measure leakage in three ways, depending on what the adversary can infer from the observed queries: These three represent increasing levels of concern. Intent leakage reveals what the agent is investigating. Answer leakage means the query log holds enough to answer a private question someone already has in hand.

Full-information leakage is the strongest case: the observer can discover and state private facts without being told what to look for.

How the mosaic effect drives MosaicLeaks's three leakage measures: Intent (predict the research questions), Answer (answer given questions about the private documents), and Full-Information (state verifiably true private claims). Here the agent searches twice about Lee's Market's 2020 traffic growth, leaking its intent, then issues a third query to answer a follow-up.

Each query looks benign alone, but seen together they let an observer deduce that the answer was 15%, and so claim that Lee's online traffic grew 15% in 2020.

MosaicLeaks contains 1,001 multi-hop research chains over local enterprise documents and a controlled web corpus. The goal is to create tasks with a high likelihood of inducing privacy leakage from enterprise documents, but that can still be solved without leaking.

Each chain interleaves local and web sub-questions. The answer to one sub-question becomes a bridge entity in the next, so the agent must retrieve local information before it can form the next useful web query. Local documents come from DRBench-style enterprise tasks, and web documents come from BrowseComp-Plus. The final split contains 559 training chains, 98 validation chains, and 344 held-out-company test chains.

The final web hop doesn't inherently contain any private information and can be answered from public web documents. However, because the path to it depends on private local facts, a query that carries forward "MediConn", "70%", and "January" gives the adversary enough context to recover internal information.

We use a simplified agent harness adapted from DRBench. The model answers each sub-question with a short answer and justification, allowing us to evaluate each hop individually with normalized string matching.

At each iteration, the model can use four tools. Plan produces local and web search queries, which are executed and returned as document cards. Choose selects which retrieved documents to read. Read attempts to answer the current hop from each selected document in parallel. Resolve decides whether to answer, read more documents, or plan another search.

One agent rollout. Each row is a hop, labeled local (L) or web (W) with its accepted answer. The colored blocks show the wall-clock time spent planning, retrieving, choosing, reading, and resolving that hop.

The obvious fix is to just ask. Add a line to the Plan prompt telling the agent not to issue web queries that leak local information, and see what happens to performance, leakage, and query behavior.

The prompt helps slightly for some models, but its effect is inconsistent and significant leakage remains. It also often has a negative effect on task performance. For Qwen3-4B, the prompt lowers answer/full-information leakage from 34.0% to 25.5%, but strict chain success drops from 48.7% to 44.5%. The primary behavioral change appears to be fewer web queries, not consistently safer query construction.

Strict chain success and privacy leakage with and without a prompt discouraging web queries that may leak local information. The prompt decreases leakage slightly for some models, but substantial leakage remains.

What to watch

MosaicLeaks is a controlled benchmark, not a measurement of leakage in deployed systems. The enterprise documents are synthetic, the web corpus is fixed, the chains span three company contexts, and every result comes from a single agent harness running multi-hop question answering rather than open-ended research. That control is what makes leakage measurable hop by hop, but broader tasks, real deployments, and other agent designs still need their own study.

Key takeaways

Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an…
A research agent at a healthcare firm is working through a routine question, and along the way it fires off a handful of ordinary-looking…
But anyone watching the agent's outbound traffic can reassemble the fragments: MediConn had migrated 70% of its infrastructure to the cloud…

What happened

This is the central tension MosaicLeaks exposes. A more informative query is often better for the task and worse for privacy. PA-DR is built to train for both sides at once.

Task-only RL improves research performance but increases leakage. PA-DR keeps almost all of the performance gain while sharply reducing it.

Training efficiency. The final column is how many generated samples each method needs to reach ~55% strict chain success. Lower is better.

Situational rewards reach outcome-reward-level task success using roughly 5-6x fewer generated samples. PA-DR keeps the sample-efficiency benefit while sharply reducing leakage.

The mosaic effect comes from how an agent searches over time, and that turns out to be something you can measure, assign credit to, and train down.

Why it matters

Full-information leakage is the strongest case: the observer can discover and state private facts without being told what to look for.

Each query looks benign alone, but seen together they let an observer deduce that the answer was 15%, and so claim that Lee's online traffic grew 15% in 2020.

The obvious fix is to just ask. Add a line to the Plan prompt telling the agent not to issue web queries that leak local information, and see what happens to performance, leakage, and query behavior.

MosaicLeaks: Can your research agent keep a secret?

What happened

Why it matters

What to watch

As Reddit stock falls, CEO questions value of Google's

Claude published malicious code to the Internet and

How a Yale AI-cheating dispute became a 13-count

Science One Framework: A verifiable autonomous research

What happened

Why it matters

What to watch