Google’s Gemma 4 open AI models use “speculative decoding” to get up to 3x faster

Google launched Multi-Token Prediction drafters for Gemma 4, an experimental feature designed to speed local AI generation through speculative decoding. The update improves performance for on-device AI, with the smaller 74M-parameter drafter models using shared KV cache and sparse decoding to reduce compute overhead. Google also moved Gemma 4 to an Apache 2.0 license, making the models more permissive for developers.

Analysis

This is less about “better local AI” as a product headline and more about Google extending the economic life of non-TPU inference. If MTP works as advertised, it raises effective tokens-per-second on commodity GPUs and mobile/edge silicon, which partially flattens the moat of hyperscaler-optimized serving stacks and shifts some value from centralized cloud inference to the model vendor and the hardware layer. The immediate beneficiaries are likely Google’s distribution and developer ecosystem, while the most exposed parties are vendors selling pure inference throughput as a premium cloud service or proprietary orchestration layer. The second-order effect is on adoption elasticity: a meaningful latency improvement lowers the threshold for embedded assistants, offline copilots, and privacy-sensitive workloads that were previously too slow or too expensive. That should expand the addressable market for local AI over the next 6-18 months, but it also risks compressing pricing for hosted inference if customers can get “good enough” performance on edge devices. In other words, Google may be trading margin concentration in cloud inference for broader model lock-in across more endpoints. The contrarian issue is that speedups from speculative decoding are often front-loaded in demos and less durable in real-world heterogeneous prompts, where verification overhead and acceptance rates matter. If the drafter’s quality is good only on narrow token distributions, the practical uplift could disappoint and the market may overestimate how quickly local AI becomes a default deployment mode. The main reversal catalyst would be evidence that developers prefer fully managed APIs once they price in battery, memory, and engineering complexity, which would leave the local-AI thesis as a niche rather than a platform shift.

AllMind

AllMind

Google’s Gemma 4 open AI models use “speculative decoding” to get up to 3x faster

Analysis

AllMind AI Terminal

Market Sentiment

Key Decisions for Investors