The fundamental challenge constraining transformer inference at scale remains the O(n²) memory requirement of self-attention mechanisms, where n denotes sequence length. While recurrent and state-space models (SSMs) offer constant-memory alternatives, their fixed-dimensional state bottleneck inevitably discards long-tail distributional information critical for coherent generation. Recent test-time training (TTT) approaches attempt to circumvent this limitation by encoding context into learnable parameters, yet they suffer from two critical failure modes: overfitting to token-level projection artifacts and disrupting the causal structure encoded during pretraining.

Absorber LLM reformulates long-context retention as a causal synchronization objective. The core insight is elegantly simple: after absorbing historical context into model parameters via gradient updates, the modified model operating without explicit context should generate distributions indistinguishable from the original model with full context available. This formulation preserves the causal dependencies learned during pretraining while enabling parameter-efficient context compression. Mathematically, the method optimizes alignment between internal activations—spanning attention patterns, hidden states, and layer-wise representations—rather than superficial output matching, which forces the model to internalize context semantics rather than memorize surface-level patterns.

The synchronization mechanism operates by computing behavioral divergence across intermediate layers during test-time adaptation, using these gradients to update parameters in ways that maintain distributional equivalence with the original model. This approach differs fundamentally from naive fine-tuning by explicitly constraining the parameter update trajectory to preserve causal relationships, preventing the model from learning spurious token-level correlations that fail to generalize beyond the training context window.

Experimental validation on established long-context benchmarks demonstrates substantial improvements in both memory efficiency and perplexity metrics relative to parameter-as-memory baselines, while maintaining competitive performance against full-context transformers on standard tasks. The method's effectiveness suggests that causal structure preservation during adaptation is essential for robust context absorption in pretrained language models.