The promise of reproducible AI systems has long hinged on a deceptively simple assumption: set temperature to zero, get deterministic outputs. Yet practitioners know this isn't true. Feed the same prompt to an LLM twice—even with identical configurations—and you'll often receive subtly different responses. This gap between theoretical expectation and empirical reality represents a fundamental challenge for any organization deploying LLMs in production environments where consistency matters: financial forecasting, medical decision support, scientific discovery workflows.

A new paper from Thinking Machines Lab formalizes what has been an awkward empirical observation into rigorous mathematical territory. Rather than dismissing non-determinism at T=0 as a minor implementation quirk, the authors propose treating it as a measurable phenomenon worthy of characterization. Their framework introduces background temperature (T_bg)—a quantity that captures the effective stochasticity introduced by the computational substrate itself, independent of the nominal sampling temperature.

The technical insight is elegant: implementation-level sources of nondeterminism don't vanish; they simply operate beneath the user-facing temperature parameter. Three primary culprits emerge from the analysis. First, batch-size variation affects how operations are grouped during inference, potentially altering numerical precision. Second, kernel non-invariance means that different hardware-level implementations of the same mathematical operation (matrix multiplications, reductions) can produce slightly different floating-point results depending on operation ordering. Third, floating-point non-associativity—the mathematical reality that (a + b) + c ≠ a + (b + c) in finite precision arithmetic—compounds across the thousands of operations in a forward pass. These aren't bugs; they're fundamental properties of numerical computation on real hardware.

The formalization proceeds by modeling the inference environment I as a stochastic perturbation process. Rather than assuming clean, deterministic computation, the framework treats the actual output distribution as governed by an effective temperature T_bg(I). The authors propose an empirical protocol to estimate this quantity: by comparing outputs from an LLM under identical prompts across multiple runs, one can fit an equivalent temperature T_n(I) that would produce the same variance in an ideal reference system. This measurement approach is model-agnostic and environment-specific, enabling systematic characterization across different hardware, batch sizes, and inference frameworks.

The empirical validation spans major LLM providers, revealing that background temperatures are neither negligible nor uniform. Different inference setups—different GPUs, batch configurations, quantization schemes—produce measurably different T_bg values. This has immediate implications. For tasks where determinism is critical (reproducible research, regulatory compliance, A/B testing), practitioners need to understand their actual system's background temperature and potentially design inference pipelines that minimize it. Conversely, for many applications, recognizing T_bg as a source of variance enables better experimental design: rather than treating non-determinism as noise to suppress, researchers can account for it statistically.

This work sits at an important intersection of machine learning systems and statistical rigor. As LLMs move from research artifacts to production infrastructure, the gap between theoretical models and actual behavior becomes increasingly costly. The field has largely ignored implementation-level randomness, treating it as beneath the abstraction level of "machine learning." Yet this randomness affects everything downstream: model evaluation (are performance differences significant or implementation artifacts?), deployment consistency, and even the validity of safety benchmarks.

CuraFeed Take: This paper addresses a problem that has been quietly festering in the ML infrastructure layer. The contribution isn't groundbreaking—practitioners already know non-determinism exists—but the formalization matters enormously. By giving background temperature a name, a mathematical definition, and a measurement protocol, the authors create a shared vocabulary for discussing something previously treated as an embarrassing implementation detail. This is exactly the kind of "boring" infrastructure work that becomes critical as systems scale. The real value emerges when organizations start systematically measuring T_bg across their inference stack and making architectural decisions accordingly. We should expect to see this concept adopted in evaluation frameworks and deployment specifications within the next 18 months. The winners will be inference optimization companies and cloud providers who can offer "low T_bg" guarantees as a service differentiator. The losers might be researchers who've built papers on supposedly deterministic zero-temperature baselines that turn out to have non-trivial background temperature. Watch for follow-up work quantifying how T_bg varies with model scale, quantization schemes, and distributed inference setups—those are the practical questions that will determine real-world impact.