Threvo introduces

SRED

The first training optimizer that detects and eliminates dead zones in feed-forward layers, automatically, during every training step, without touching model quality.

The Problem

The cost of training AI
is exploding.
The waste inside every run
is unsustainable.

$500M
per frontier run · 2025
Source: Epoch AI, 2025
$350B
AI compute spend · 2025
Source: IDC, 2025
electricity demand · 2030
Source: IEA, 2026

Frontier model training crossed $500M per run in 2025. By 2027, $1B to $3B. The largest technology companies spent $350 billion on AI compute infrastructure in 2025 alone.

AI data center electricity consumption surged 50% in 2025 and is projected to triple by 2030, approaching the entire electricity consumption of Japan.

Inside every run, from $1M fine-tunes to $500M pretraining jobs, a significant portion of compute is spent on operations that contribute zero to model intelligence, burning physical cluster power unnecessarily. A large and growing fraction of FFN neurons become computationally inactive as models train, yet every optimizer continues computing them. Baked into every training pipeline on earth. With every optimizer. At every scale.

The field has known.
Nobody fixed it during training.

Until now.

The Opportunity

The bigger the model,
the bigger the waste.
The bigger the waste,
the bigger the saving.

7B run $50K – $500K
70B run $1.2M – $6M
Frontier 2026 $200M – $500M
Frontier 2027 $1B – $3B
Sources: Epoch AI, 2025 · Galileo AI, 2025

Redundant compute is not a fixed cost. It compounds with model size, with cluster scale, with every generation of infrastructure the industry deploys.

The organizations operating at the frontier of AI are not facing an optimization problem. They are facing a structural one. Every training run, regardless of architecture, dataset, or hardware, carries the same inefficiency at its foundation.

At 7B scale, early benchmark data suggests SRED reduces dead-zone compute by [X]%, translating to [$X] in direct energy savings per run. At frontier scale, that number compounds by orders of magnitude.

Full results publishing Q3 2026

How It Works

The optimizer built for how models actually train.

AdamW

Standard optimizers compute everything.

Every neuron. Every layer. Every forward pass. As training progresses, a growing fraction of FFN neurons become computationally inactive, reaching the majority of FFN capacity at scale. AdamW computes them every step regardless.

Source: Voita et al., "Neurons in Large Language Models: Dead, N-gram, Positional" — ACL 2024

SRED

Detects computationally inactive FFN neurons. Skips them. Every step.

Each training step, SRED eliminates inactive FFN neurons. The rest of the model trains unaffected.

Inactive neuron count grows throughout training as the model specializes, and compounds further with scale. SRED's energy savings grow with both.

Optimizer level. No pipeline changes. No quality loss.

LLaMA Mistral GPT Falcon T5 Qwen

Works across every major transformer architecture.

Integration

Drop in. Nothing else changes.

SRED is a drop-in replacement for AdamW. Your training loop, your model, your pipeline — untouched.

train.py
 
1# Setup - unchanged
2model = FSDP(LlamaForCausalLM(config), sharding_strategy=ShardingStrategy.FULL_SHARD)
 
- optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
+ optimizer = SREDOptimizer(model.parameters(), lr=3e-4, weight_decay=0.1, model=model) ← only change
 
3scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(0.01 * total_steps),
    num_training_steps=total_steps
)
 
4# Training loop - unchanged
5for step, batch in enumerate(train_loader):
6    with torch.autocast("cuda", dtype=torch.bfloat16):
7        loss = model(**batch).loss / grad_accum_steps
8    loss.backward()
9    if (step + 1) % grad_accum_steps == 0:
10        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
11        optimizer.step()
12        scheduler.step()
13        optimizer.zero_grad()
 
Training loop unchanged · No model modifications · No hyperparameter changes · FSDP compatible

If it runs on AdamW today, it runs on SRED tomorrow.

Early Access

Start eliminating wasted compute from day one.

For ML teams running serious training workloads. Limited early access open now.

We review every request personally. Compute grant applications welcome.