Edge AI — running inference at CDN points-of-presence instead of centralized data centers — has matured from interesting idea to production viable over 2024-2026. Cloudflare Workers AI, Vercel Edge Functions, AWS Lambda@Edge, and Fastly Compute all ship meaningful AI capability now. The fit is narrow but genuine. This post is where edge deployment wins and where centralized still beats it.

When edge wins

Edge: small models, global low-latency needs, classification and routing. Centralized: large models, complex reasoning, agent systems with shared state.

What edge deployment offers

Latency. A user in Mumbai calling a model in us-east-1 pays 150-300ms of network latency. An edge deployment in Mumbai serves them from 5ms away. For latency-sensitive UX, this is game-changing.

Cost. For small models, edge serving is often cheaper than routing to centralized GPU clusters. Edge providers amortize across many customers; the marginal cost per request is low.

Compliance. Data residency requirements can be met by serving requests in the user's jurisdiction. EU users served by EU edges; never leaves the region.

Platform comparison

Cloudflare Workers AI. Small open-source models (Llama 3.1 8B, Mistral 7B-class) served from 300+ points of presence. Simple API, good pricing, established developer ecosystem.

Vercel Edge Functions. Integrated with Next.js deployments. AI SDK for streaming. Best fit for teams already on Vercel. Good support for combining edge functions with centralized AI APIs.

AWS Lambda@Edge. Tighter AWS integration (IAM, VPC, logging). Steeper learning curve. Better for enterprise deployments with AWS-heavy infrastructure.

Fastly Compute. WebAssembly-based, Rust-first. High performance, less AI-specific tooling. Worth considering for teams comfortable with Rust.

Workloads that fit edge

Classification. Is this query about X, Y, or Z? Small model can handle. Routing happens at edge; complex queries get forwarded to centralized.

Simple extraction. Named entity recognition, structured field extraction, moderation flags. Small models handle these well; edge serving cuts latency for every user globally.

Personalization. User-specific context (preferences, history) stored at edge; simple personalization model runs there. Avoids round-tripping user data to central servers.

Short responses. Autocomplete, query suggestion, brief translations. Fits edge model capabilities and size budgets.

Workloads that need centralized

Large models. 70B+ parameter models don't fit on edge hardware. Frontier reasoning, long-context tasks, multi-turn conversations with complex state — all centralized.

Agent systems. Agents call tools, retrieve data, coordinate. Shared state argues for centralized architecture. Edge can be the entry point but complexity quickly pulls computation back to center.

Consistent experience across regions. If users from different regions must see identical behavior, edge deployments (which can diverge) aren't the right fit.

Hybrid architectures

Edge as router. Request hits edge; small model decides whether to answer locally (fast) or forward to centralized (slower but more capable). User sees edge response when possible, centralized otherwise.

Edge for preprocessing. PII redaction, content moderation, request classification happen at edge. Clean request forwarded to centralized. User data minimized in central logs.

Edge for caching. Semantic or exact-match cache at edge; cache miss forwards to centralized. Hot queries served from edge; cold queries pay full roundtrip.

Challenges

Model distribution. Edge platforms have growing but limited model catalogs. If your preferred model isn't available, you can't use that platform. Check availability before committing.

Cost surprises. Per-request pricing at edge can add up if you're making many calls. Model carefully; compare to centralized pricing at your expected volume.

Debugging across regions. A bug that only appears in one edge region is harder to debug than a centralized bug. Good logging and tracing essential. See observability post.

When edge is the right choice

Global user base with latency sensitivity. Millisecond-level UX matters.

Small-model workloads (classification, extraction, routing) at high volume. Cost savings add up.

Compliance or data residency requirements. Edge-based deployment simplifies regulatory stories.

Not when: centralized managed APIs meet your needs, you're pre-scale, or your workloads require frontier models.

Edge AI deployment: inference at the network edge

What edge deployment offers

Platform comparison

Workloads that fit edge

Workloads that need centralized

Hybrid architectures

Challenges

When edge is the right choice

Continue the thread.

Latency budgeting for LLM systems

On-device LLMs: iOS, Android, and the local-first pattern

Self-hosting vs managed: GPU decisions in 2026

Want to talk about this?