Anemll engineers ran a 400‑billion‑parameter AI model on an iPhone 17 Pro, streaming weights directly from the device's SSD at 0.6 tokens per second. The team bypassed the phone's 12 GB RAM limit by reading model data on demand. The proof of concept shows massive language models can operate on consumer smartphones without cloud connections. Performance remains too slow for real‑time chat, yet the technique opens a path toward privacy‑first AI apps that keep sensitive data on device.
What's new: Flash‑MoE, a mixture of experts architecture, activates only a tiny subset of the model's 400 billion parameters for each token. That selective approach cuts active memory demand. The SSD feeds weights to the GPU in slices, turning storage speed into the main bottleneck. Each token takes roughly two seconds to generate, a pace that rules out fluid conversations but suits offline drafting or summarization tasks.
How it works: A token represents the smallest unit of text the model processes. Conventional inference loads the entire model into RAM, which would require over 200 GB. Anemll's method reads only the active parameters from flash storage, so the phone handles a fraction of the total weight at any moment.
Trade‑offs in battery and hardware wear: Sustained inference drains the battery significantly. SSD read cycles increase substantially under continuous use, potentially shortening device longevity with heavy workloads. The phone stays cool during short sessions, yet extended generation heats the chassis. Developers will need to balance power draw against user expectations for always‑on AI features.
Market momentum: Edge AI chip shipments reached 44.2% of all AI chip volume in 2025, according to Mordor Intelligence. Meanwhile, Pew Research found that 85% of U.S. consumers expect stronger privacy protections, reinforcing demand for on‑device inference.
What it means for developers: Prototyping privacy‑first apps becomes practical. Medical note‑taking, legal document review, and personal journaling can now run entirely offline. Real‑time chat remains out of reach until storage speeds double or model architectures shrink further. The technique may also reduce cloud costs for enterprises willing to trade latency for data sovereignty.
Will storage‑streamed models enable everyday privacy‑first AI on smartphones, or will cloud‑based inference continue to dominate for speed‑sensitive tasks?

















