What Is On-Device AI — and Why It Matters

On-device AI runs artificial intelligence models locally on physical hardware — a smartphone, laptop, wearable, or embedded chip — instead of sending data to remote cloud servers. When a feature is powered by on-device AI, your input never leaves your device; all computation happens on the hardware you hold.

This contrasts with cloud AI, where your query travels over the internet to a data center, gets processed by large servers, and a result returns. Cloud models can be enormously capable, but they require a network connection, add latency, and expose data during transmission.

How On-Device AI Works

Fitting capable AI onto a smartphone requires compression techniques that shrink large models without gutting their accuracy.

Quantization converts a model’s numerical weights from high-precision 32-bit floating point to lower-precision integers — INT8 or INT4 — reducing memory use by 4-8x with minimal accuracy loss. Techniques like AWQ (Activation-Aware Weight Quantization) push this even further.

Pruning removes redundant weights that contribute little to a model’s output, sometimes cutting 70-90% of parameters while retaining most accuracy.

Knowledge distillation trains a compact student model to mimic a larger teacher, transferring capability without matching parameter count.

The resulting models run on dedicated silicon called Neural Processing Units (NPUs) — chips designed specifically for the matrix arithmetic that underlies AI inference. Apple’s Neural Engine in the A18 chip delivers 35 tera-operations per second (TOPS). The Qualcomm Hexagon NPU in the Snapdragon 8 Elite delivers 45 TOPS and runs AI workloads up to 9x more efficiently than the same chip’s CPU.

Why It Matters

Privacy: Data never leaves the device — the natural choice for sensitive contexts such as health records, personal messages, and financial data. Apple Intelligence describes this approach as being aware of your personal information without collecting your personal information.

Speed: On-device inference typically takes 10-15 milliseconds. A cloud round-trip averages 50-400 ms. Real-time applications — live voice translation, augmented-reality overlays, instant autocorrect — demand on-device speeds.

Offline access: No network connection is needed. On-device AI features work on a plane, in a basement, or anywhere connectivity is poor.

Cost: After a model is downloaded, each inference costs nothing beyond battery power — far cheaper than metered cloud API calls at scale.

What’s Already in Your Pocket

Several consumer devices now ship with meaningful on-device AI at no extra charge:

Apple Intelligence (iPhone 15 Pro/Max, iPhone 16+, iPad with M1+ chip, Mac with M1+ chip): writing tools, priority notifications, Live Translation, and AI photo editing — free with iOS 26 and macOS Tahoe 26. Heavier tasks offload to Apple’s Private Cloud Compute, its own end-to-end encrypted server infrastructure.
Google Gemini Nano (Pixel 9, Pixel 10): Google’s smallest Gemini model runs entirely on-device, powering Scam Detection, Call Notes, and real-time voice translation. The Pixel 10’s Tensor G5 chip runs Gemini Nano 2.6x faster than the previous generation at half the energy.
Samsung Galaxy AI (Galaxy S25 series, Snapdragon 8 Elite): on-device inference reaches 92 tokens per second for a 350M-parameter model on the Hexagon NPU. Core features are free.

What Developers Can Adopt

The lowest-barrier entry point for developers today is Liquid AI’s LFM2.5-230M, a 230-million-parameter small language model with open weights on Hugging Face (free to download). Despite its tiny size, LFM2.5-230M outperforms models up to 4x larger on data extraction and tool-calling tasks. It runs at 213 tokens per second on a Galaxy S25 Ultra CPU and at 42 tokens per second on a Raspberry Pi 5 — no GPU required. It integrates with llama.cpp, MLX, ONNX Runtime, LM Studio, and vLLM.

Other notable small language models in the accessible range include Meta’s Llama 3.2-1B, Google’s Gemma 3, and Microsoft’s Phi-4.

In the News

Liquid AI’s release of LFM2.5-230M — covered in today’s news brief — illustrates the direction of travel: useful AI that runs on a Raspberry Pi, outperforming rivals four times its size.

FAQ

Can on-device AI match cloud AI quality?
For everyday tasks — summarization, translation, autocorrect, smart replies — on-device models perform well. For complex reasoning, coding, or long-form generation, cloud-scale models still hold a clear advantage.

Which devices support on-device AI today?
iPhones from the 15 Pro/Max and 16 series, Google Pixel 9 and 10, and Samsung Galaxy S25 all include dedicated NPU hardware. Most consumer features are free and built into the device software.

What is a small language model (SLM)?
A small language model is an AI language model designed to run on limited hardware. There is no fixed parameter threshold; in practice, models under roughly 7-10 billion parameters that fit on a single consumer device are called SLMs.

Is on-device AI free to use?
Consumer features — Apple Intelligence, Gemini Nano on Pixel, Samsung Galaxy AI — are free. Open-weight models like LFM2.5-230M are free to download and run locally. Cloud AI APIs, by contrast, charge per token used.