LLMs corrupt the documents they work on. Does agentic AI make it worse?

You give an LLM a spreadsheet and tell it to make some changes to columns A and B. Bad news: you might have just messed up columns C and D.Anew paperfrom researchers at Microsoft evaluated 19 different large language models (LLMs) and came to a startling conclusion: as an LLM edits a document, it corrupts the content of that document. The more times an LLM touches a document, the more the overall document degrades. It’s like the classiccopy of a copyproblem, but worse: if you copy a photo too many times, it gets fuzzy and faded; if an LLM edits a document too many times, it gets wrong.Here’s how the study worked, and what it revealed.

First, the researchers curatedDELEGATE-52, a dataset of documents across 52 different professional domains, from accounting ledgers to aviation bulletins, calendars to crystal structures. Then they designed prompts for performing relevant edit tasks for each document. More specifically, they designed pairs of edit tasks: a “forward” instruction to change the document and a “backward” instruction that reverses the change. Example: for a piece ofsheet musicin G major, the forward instruction “transpose this up a perfect fourth to C major” is followed by “transpose this down a perfect fourth to G major.” In other words, pitch the music up, then pitch it down by the same interval. In theory, each “round trip” of forward and backward edit should yield a document identical to the original. But it doesn’t.The researchers quantified corruption by comparing the original document to the state of the document after each round trip. On average, after just two simulated LLM interactions (or one round trip), 18% of a document’s content no longer matched the original. After six interactions, a third of document content was corrupted. After 20 interactions, the documents were, on average, over 50% corrupted.The severity of the problem depended on the type of document being worked on. In general, LLMs corrupted documents less when they were repetitive, numerical and structurally dense—they almost perfectly preserved Python code—and corrupted them more when they contained mostly natural language prose such as creative writing or recipes.Some LLMsfared better than others, but by the end of the simulation even the top three models—Gemini 3.1 Pro, Claude 4.6 Opus and GPT-5.4—had degraded documents by 25% on average. The Microsoft paper makes it clear that LLMs alone, acting outside of anyagentic architecture, shouldn’t be trusted with complex documents.“None of this is surprising or shocking,” said Mihai Criveti, a Distinguished Engineer at IBM and Chief Architect ofwatsonx Orchestrate, in an interview withIBM Think. “Large language models are unreliable narrators.”This, one might think, is a problem we can solve withagentic AI. An agentic harness ostensibly equips an LLM with tools, instructions, guidelines and guardrails that help it to execute tasks more reliably and autonomously. So the researchers’ next observation might seem surprising: basic agentic tool use actually made things (slightly) worse.

— Mihai Criveti, Distinguished Engineer and Chief Architect of watsonx Orchestrate at IBM

Comments (0)