Caching LLM Extractions Without Lying: Conformal Gates + a Reasoning Budget Allocator
The extraction pipeline processed 2,400 documents overnight. Cost: $380. The next morning I diffed the inputs against the previous batch—87% were near-duplicates with trivial whitespace changes. I’...

Source: DEV Community
The extraction pipeline processed 2,400 documents overnight. Cost: $380. The next morning I diffed the inputs against the previous batch—87% were near-duplicates with trivial whitespace changes. I’d burned $330 re-extracting answers I already had. Not because the cache missed. Because my cache had no right to hit. A TTL can tell you when something is old. It cannot tell you when something is wrong. And for an AI extraction pipeline, “wrong” is the only thing that matters. So I rebuilt the caching layer around a different idea: caching is a statistical validity problem, not an expiry problem. Then I paired it with a second idea that sounds obvious until you implement it: reasoning depth is a budget allocation problem, not a model selection problem. What I ended up with in production is a two-stage system: Confidence-gated cache: per-selector reuse vs partial rebuild using a multi-signal score and conformal thresholds. Reasoning budget allocator: per-span compute decisions under a fixed