On-device LLMs ship today on flagship iPhones (Apple Intelligence, MLX), Android (Gemini Nano, AICore), and any device that can run llama.cpp. The privacy story is compelling; the performance on modern hardware is genuinely usable for specific workloads. This post covers the platforms, the workloads that fit, and the engineering patterns for shipping on-device AI in production apps.
Why on-device
Privacy. User data never leaves the device. For medical, personal, financial, legal — categories where users care deeply about privacy — this is a meaningful differentiator.
Offline. Apps work without network. Airplane, underground, spotty rural connection — AI features keep working.
Latency. No network round-trip. Sub-100ms responses for tasks the on-device model can handle. Keyboard predictions, autocomplete, instant classification — all feel magical when local.
Cost. You're not paying inference costs — the user's device does the work. At scale, this shifts your cost structure dramatically.
iOS stack
MLX. Apple's open-source ML framework for Apple Silicon. Fast on M-series chips. Growing community, good documentation. Our default for custom iOS AI.
Apple Intelligence. System-level AI available via APIs (Writing Tools, Image Playground, Genmoji). Users already have it enabled; integrate via APIs rather than running your own model.
Core ML. Apple's inference runtime. Supports converted models. Mature but less popular than MLX for LLMs specifically.
llama.cpp with Metal backend. Cross-platform option that works on iOS. Good when you need a specific model not supported natively.
Android stack
Gemini Nano. Google's on-device model available on Pixel 8+ and growing device range. Accessed via AICore APIs (Android 14+). Summarization, smart reply, proofreading built in.
MediaPipe LLM Inference. Google's framework for running LLMs on Android. Supports Gemma, Llama, and others. Good for custom models where Gemini Nano isn't sufficient.
llama.cpp via NDK. Runs on Android with native code. Good for cross-platform apps that want the same model on iOS and Android.
Cross-platform
llama.cpp. Works on iOS, Android, desktop, server. Single codebase, ported models. Trade-off: marginally slower than platform-native.
MLC-LLM. Machine Learning Compilation for LLMs. Browser (WebGPU) and native targets. Good for web apps needing local AI.
ONNX Runtime. Multi-framework support. Mature ecosystem. Good when you have models from varied training frameworks.
WebLLM. Browser-based, WebGPU-accelerated. Entirely client-side AI in the browser. Privacy-maximum, some performance tradeoffs.
Workloads that fit on-device
Text prediction and autocomplete. Keyboard, writing apps, note-taking. Sub-100ms response, privacy-critical, bounded task.
Smart reply. Email, messaging. Suggests short responses based on message context. Fits small-model capability.
Classification and routing. User query, email, document — on-device classifier decides whether to handle locally or route to cloud.
Summarization of user content. Notes, documents, emails — user's own data, summarized locally without uploading.
Transcription and translation. Whisper-class models run locally; user voice data never transmitted.
Workloads that don't fit
Complex reasoning. Models that fit on phones (3-8B parameters) don't match frontier model capabilities for hard tasks. Complex Q&A, coding, research — still cloud.
Long context. On-device memory budget limits context windows. Long-document tasks still need cloud.
Multimodal at scale. Image + text + audio understanding is improving on-device but still lags cloud options significantly.
Engineering patterns
Hybrid architecture. On-device handles 60-80% of requests; cloud handles the hard ones. User sees instant responses for most interactions, acceptable latency for the rest.
Model size trade-offs. Smaller models launch faster, use less memory, drain less battery. Larger models are more capable. Typical sweet spot: 3-8B parameter models quantized to 4-bit. See quantization post.
Model updates. Shipping new model weights via app updates (slow, versioned) vs on-demand download (faster iteration, more network use). Pattern depends on model stability and update cadence.
Battery impact. On-device inference burns battery faster than network calls. Measure on actual devices; warn users when extensive use.