Leading Inference Providers Cut AI Costs by up to 10x With Open Source Models on NVIDIA Blackwell

Leading inference providers running open-source models on NVIDIA’s Blackwell platform report dramatic token-cost reductions that could materially change AI economics across industries: Sully.ai cut inference costs by 90% (10x) and improved response times by 65%, returning over 30 million minutes to physicians; DeepInfra lowered cost per million tokens for a MoE model from $0.20 (Hopper) to $0.05 (Blackwell) — a 4x improvement; Fireworks/Sentient achieved 25–50% better cost efficiency and handled a viral 1.8 million waitlist and 5.6 million weekly queries; Together/Decagon cut cost-per-voice-query 6x with sub-400 ms latencies. NVIDIA positions this as platform-level advantage — citing GB200 NVL72 and the upcoming Rubin system as delivering another 10x step — a development that could meaningfully affect infrastructure spending, model deployment economics and vendor selection in AI-dependent verticals.

Analysis

Market structure: NVIDIA (NVDA) is the primary beneficiary — extreme hardware-software codesign (Blackwell/Rubin) creates multi-quarter pricing power in datacenter GPUs, forcing competitors (AMD, INTC) to defend share or concede on ASPs. Open‑source inference stacks (Baseten, DeepInfra, Together) expand addressable demand by lowering token cost 4x–10x, which should increase token volumes and GPU demand even as per‑token revenues fall. Risk assessment: Tail risks include export controls on advanced node GPUs, regulatory limits on open models (EU AI Act) and healthcare liability from model errors; each could halve near-term revenue growth in worst cases. Immediate (days) risk centers on NVDA guidance/earnings; short term (3–6 months) on partner adoption and TSMC capacity; long term (12–36 months) on Rubin ramp and broader capex cycles. Trade implications: Direct long bias to NVDA and TSMC, paired with defensive hedges against a guidance miss; use 6–12 month call spreads to express upside while capping cost. Short/underweight AMD/INTC relative to NVDA — Blackwell’s lead and software stack create widening moat. Rotate into datacenter software/inference platforms (public cloud beneficiaries) and reduce exposure to pure-play, high-cost inference SaaS where tokenomics worsen. Contrarian angles: Consensus understates engineering/frictional costs of migrating closed models to open-source (fine-tuning, safety, latency tuning), so tokenomics improvements may be slower than case studies imply. Market may be overpricing immediate 10x wins; a 20–30% pullback in NVDA on a single-quarter miss is plausible. Watch for second-order effects: cheaper tokens could compress SaaS pricing and force earlier-than-expected consolidation among inference providers.

AllMind

AllMind

Leading Inference Providers Cut AI Costs by up to 10x With Open Source Models on NVIDIA Blackwell

Analysis

AllMind AI Terminal

Market Sentiment

Key Decisions for Investors