Multiplication-Free LLM Inference: Ternary Quantization Meets CPU Optimization

The deployment of large language models on CPU-only infrastructure faces a fundamental constraint: memory bandwidth becomes the critical bottleneck during autoregressive decoding. Traditional quantization approaches reduce this pressure through weight compression, yet they perpetuate expensive floating-point arithmetic during inference. FairyFuse reconceptualizes this problem by leveraging ternary weight quantization, where network parameters collapse to the discrete set {-1, 0, +1}, enabling a multiplication-free computational paradigm.

The core innovation exploits the mathematical structure of ternary weights within widely-linear transformations. Each real-valued GEMV operation decomposes into eight sub-GEMVs corresponding to the ternary weight components. Rather than executing these sequentially with dequantization overhead, FairyFuse fuses the entire computation into a single AVX-512 vectorized loop. The multiplication operation reduces to conditional addition, subtraction, or identity operations—primitive CPU instructions that saturate vector units without consuming floating-point execution resources. This architectural alignment directly addresses the roofline model: 16× weight compression shifts memory-bound operations toward the compute-bound regime, unlocking previously inaccessible performance on bandwidth-constrained platforms.

Empirical validation demonstrates that ternary quantization incurs minimal quality degradation. WikiText-2 perplexity remains near-lossless (5.52 vs. 5.47 for FP16), and downstream task accuracy achieves 66.0%—competitive with contemporary baselines. The performance differential between CPU and GPU implementations is striking: while FairyFuse yields 29.6× kernel speedup on CPUs, the same optimization provides negligible gains on GPUs, where memory bandwidth is less constraining and floating-point throughput dominates.

This work underscores a critical insight for ML systems design: optimal inference architectures diverge significantly across hardware classes. CPU-centric deployment demands fundamentally different quantization and execution strategies than GPU-accelerated inference, challenging the assumption that universal quantization schemes suffice across deployment targets.

Multiplication-Free LLM Inference: Ternary Quantization Meets CPU Optimization

Keep reading