oMLX: The Local LLM Server That Actually Works on Apple Silicon

If you've tried running LLMs locally on a Mac, you've probably hit the same wall: Ollama works but caches are slow across requests. llama.cpp is fast but you wire up the rest yourself. vLLM doesn't really support Apple Silicon. LM Studio is pretty but a closed box.

oMLX is an open-source attempt at fixing this for Apple Silicon specifically. The project is by jundot (Hugging Face), and its central claim is that tiered KV cache — hot in-memory, cold SSD — is what makes local LLMs usable for real coding work, not just toy demos.

The KV cache problem

When an LLM generates a long response, it stores attention keys and values (the "KV cache") in GPU memory. For long contexts (50K+ tokens), the cache can be larger than the model itself. On a 64GB Mac, that means a single 32B model conversation can exhaust your unified memory.

The standard fix is to recompute the cache per request, which is slow. oMLX's approach: keep recently-used cache blocks in RAM, and spill older blocks to SSD. When the model needs an evicted block, it's reloaded from disk instead of recomputed. For a long-running Claude Code session, this means a 60% reduction in time-to-first-token after the cache warms.

According to the README, the project supports text LLMs, vision-language models, OCR models, embeddings, and rerankers — all on the same MLX-based stack. A built-in chat UI runs at http://localhost:8000/admin/chat. The macOS app also installs a ~/.omlx/bin/omlx CLI shim so terminal apps and Apple Shortcuts can control the server.

Why this matters now

Three trends are converging:

MLX has matured. Apple's MLX framework hit a stable API in 2024 and is now the default path for new research code targeting Apple Silicon. The oMLX project bets that MLX — not llama.cpp's Metal port — is the right long-term substrate.
Model sizes for "good enough" assistants are 32-70B. A 32B 4-bit quant fits in 24GB. A 70B 4-bit needs 40GB. These are the practical sizes for coding work in 2026, and they live or die on context length — exactly the case oMLX is built for.
Local coding agents want sustained sessions. When you run Claude Code or Goose against a local model, you're not doing one-shot prompts. You want a 2-hour coding session where the model keeps the conversation, files, and tool outputs in working memory.

An article in Medium called tiered KV cache "the missing piece in Apple Silicon LLM inference" — and oMLX is currently the only project shipping this as a turnkey server.

How it compares

Project	Apple Silicon	KV cache tiering	Open source	Coding agent ready
oMLX	Yes (MLX-native)	Yes (RAM + SSD)	Yes (Apache 2.0)	Yes
Ollama	Yes (Metal)	No (recompute)	Yes (MIT)	Limited
LM Studio	Yes	Partial (RAM only)	No (closed source)	No
llama.cpp server	Yes	No (recompute)	Yes (MIT)	Manual setup
vLLM	No	Yes (PagedAttention)	Yes (Apache 2.0)	Yes (NVIDIA only)

The honest read: oMLX is the most Apple Silicon-native of these projects and the only one with tiered caching. The trade-off is ecosystem maturity — Ollama has the model library and community, oMLX is a single developer shipping a focused product.

Where it falls short

Single-developer risk. jundot is the primary maintainer. If they step away, the project is in trouble. This is true of many early-stage OSS projects but worth flagging.
Model coverage is MLX-only. Models without MLX ports (some new releases) won't work until a port lands.
The chat UI is functional, not polished. If you want a beautiful local chat client, use LM Studio or Jan. oMLX's chat is for testing, not for daily chatting.

When to use it

You have a 32GB+ M-series Mac and you want to run a 32B or 70B model for sustained coding sessions.
You're already using Goose, Aider, or Claude Code against a local LLM and the cold-start tax is killing your flow.
You care about the data not leaving your machine and the convenience of an OpenAI-compatible endpoint beats hand-rolling.

If none of those apply, Ollama is still the right default. If one of them does, oMLX is worth the install.

---

Explore 40+ AI tools on TokenJoy.ai

Real reviews, pricing, and comparisons — updated weekly.

Browse AI Tools →