If you've tried running LLMs locally on a Mac, you've probably hit the same wall: Ollama works but caches are slow across requests. llama.cpp is fast but you wire up the rest yourself. vLLM doesn't really support Apple Silicon. LM Studio is pretty but a closed box.

oMLX is an open-source attempt at fixing this for Apple Silicon specifically. The project is by jundot (Hugging Face), and its central claim is that tiered KV cache — hot in-memory, cold SSD — is what makes local LLMs usable for real coding work, not just toy demos.

The KV cache problem

When an LLM generates a long response, it stores attention keys and values (the "KV cache") in GPU memory. For long contexts (50K+ tokens), the cache can be larger than the model itself. On a 64GB Mac, that means a single 32B model conversation can exhaust your unified memory.

The standard fix is to recompute the cache per request, which is slow. oMLX's approach: keep recently-used cache blocks in RAM, and spill older blocks to SSD. When the model needs an evicted block, it's reloaded from disk instead of recomputed. For a long-running Claude Code session, this means a 60% reduction in time-to-first-token after the cache warms.

According to the README, the project supports text LLMs, vision-language models, OCR models, embeddings, and rerankers — all on the same MLX-based stack. A built-in chat UI runs at http://localhost:8000/admin/chat. The macOS app also installs a ~/.omlx/bin/omlx CLI shim so terminal apps and Apple Shortcuts can control the server.

Why this matters now

Three trends are converging:

  1. MLX has matured. Apple's MLX framework hit a stable API in 2024 and is now the default path for new research code targeting Apple Silicon. The oMLX project bets that MLX — not llama.cpp's Metal port — is the right long-term substrate.
  2. Model sizes for "good enough" assistants are 32-70B. A 32B 4-bit quant fits in 24GB. A 70B 4-bit needs 40GB. These are the practical sizes for coding work in 2026, and they live or die on context length — exactly the case oMLX is built for.
  3. Local coding agents want sustained sessions. When you run Claude Code or Goose against a local model, you're not doing one-shot prompts. You want a 2-hour coding session where the model keeps the conversation, files, and tool outputs in working memory.

An article in Medium called tiered KV cache "the missing piece in Apple Silicon LLM inference" — and oMLX is currently the only project shipping this as a turnkey server.

How it compares

ProjectApple SiliconKV cache tieringOpen sourceCoding agent ready
oMLXYes (MLX-native)Yes (RAM + SSD)Yes (Apache 2.0)Yes
OllamaYes (Metal)No (recompute)Yes (MIT)Limited
LM StudioYesPartial (RAM only)No (closed source)No
llama.cpp serverYesNo (recompute)Yes (MIT)Manual setup
vLLMNoYes (PagedAttention)Yes (Apache 2.0)Yes (NVIDIA only)

The honest read: oMLX is the most Apple Silicon-native of these projects and the only one with tiered caching. The trade-off is ecosystem maturity — Ollama has the model library and community, oMLX is a single developer shipping a focused product.

Where it falls short

When to use it

If none of those apply, Ollama is still the right default. If one of them does, oMLX is worth the install.

---

Explore 40+ AI tools on TokenJoy.ai

Real reviews, pricing, and comparisons — updated weekly.

Browse AI Tools →