The first training optimizer that detects and eliminates dead zones in feed-forward layers, automatically, during every training step, without touching model quality.
The Problem
Frontier model training crossed $500M per run in 2025. By 2027, $1B to $3B. The largest technology companies spent $350 billion on AI compute infrastructure in 2025 alone.
AI data center electricity consumption surged 50% in 2025 and is projected to triple by 2030, approaching the entire electricity consumption of Japan.
Inside every run, from $1M fine-tunes to $500M pretraining jobs, a significant portion of compute is spent on operations that contribute zero to model intelligence, burning physical cluster power unnecessarily. A large and growing fraction of FFN neurons become computationally inactive as models train, yet every optimizer continues computing them. Baked into every training pipeline on earth. With every optimizer. At every scale.
The field has known.
Nobody fixed it during training.
Until now.
The Opportunity
Redundant compute is not a fixed cost. It compounds with model size, with cluster scale, with every generation of infrastructure the industry deploys.
The organizations operating at the frontier of AI are not facing an optimization problem. They are facing a structural one. Every training run, regardless of architecture, dataset, or hardware, carries the same inefficiency at its foundation.
At 7B scale, early benchmark data suggests SRED reduces dead-zone compute by [X]%, translating to [$X] in direct energy savings per run. At frontier scale, that number compounds by orders of magnitude.
Full results publishing Q3 2026
How It Works
Every neuron. Every layer. Every forward pass. As training progresses, a growing fraction of FFN neurons become computationally inactive, reaching the majority of FFN capacity at scale. AdamW computes them every step regardless.
Source: Voita et al., "Neurons in Large Language Models: Dead, N-gram, Positional" — ACL 2024
Each training step, SRED eliminates inactive FFN neurons. The rest of the model trains unaffected.
Inactive neuron count grows throughout training as the model specializes, and compounds further with scale. SRED's energy savings grow with both.
Optimizer level. No pipeline changes. No quality loss.
Works across every major transformer architecture.
Integration
SRED is a drop-in replacement for AdamW. Your training loop, your model, your pipeline — untouched.
If it runs on AdamW today, it runs on SRED tomorrow.
Early Access
For ML teams running serious training workloads. Limited early access open now.
We review every request personally. Compute grant applications welcome.