On March 30, 2026, Ollama announced it was switching to MLX as its inference engine on Apple Silicon (currently in preview). While this news sparked widespread discussion, most of the focus remained on Ollama itself. What deserves more attention is the target of this switch: Apple’s MLX framework. Understanding what MLX is, what it gets right on Apple Silicon, the state of its ecosystem, and its current limitations is of greater long-term value than the Ollama update itself.
MLX is a machine learning framework designed by Apple specifically for its own silicon. Its core design revolves around the unified memory architecture of Apple Silicon: the CPU and GPU share the same physical memory, allowing tensor operations to be passed between them with zero-copy. This eliminates the overhead of CPU-to-GPU memory transfers found in traditional frameworks. For LLM inference tasks, where memory bandwidth is the primary bottleneck, this architectural advantage translates directly into performance gains.
The framework has 24.9k stars on GitHub, with its companion inference tool, mlx-lm, holding 4.3k stars. The mlx-community organization on HuggingFace has accumulated 4,316 pre-converted models, ensuring that community-quantized MLX versions of almost all major open-source models are available within days of their release.
In January 2026, Apple published a research paper (Apple ML Research) demonstrating the performance of dedicated Neural Accelerators integrated into each GPU core of the M5 chip when running MLX. For Qwen3-14B-4bit, the time-to-first-token was 4.06x faster than the M4, and token generation was 1.19x faster. This improvement stems from hardware circuits specifically designed for MLX compute patterns, rather than simply higher GPU frequencies or increased bandwidth.
This sends a clear signal: Apple optimized its hardware for MLX during the chip design phase. Since the Neural Accelerator is specifically targeted at MLX compute graphs, the performance advantage of MLX on Apple Silicon will continue to widen with each chip iteration. In contrast, the Metal backend of llama.cpp essentially translates CUDA compute patterns into Metal shaders, incurring an abstraction loss and failing to automatically benefit from this specialized hardware.
At WWDC 2025, Apple established MLX as the preferred framework for LLM inference on Apple Silicon through three dedicated sessions. They also released the Foundation Models framework—a Swift API for app developers that directly calls on-device models with approximately 3B parameters. MLX targets researchers and advanced developers, while Foundation Models serves app developers, creating a parallel two-tier approach. Ollama’s March 2026 announcement of its switch to MLX further validates this roadmap.
Multiple independent benchmarks support the advantage of MLX on Apple Silicon during the steady-state token generation phase.
On a Mac mini M4 Pro (64GB) running Qwen3-Coder-30B-A3B (an MoE architecture), MLX achieved approximately 130 tok/s, compared to 43 tok/s for Ollama with the llama.cpp backend—a 3x difference (Reddit r/LocalLLM). On an M4 Max (128GB) running Qwen3.5-35B-A3B, MLX reached 130 tok/s versus Ollama’s 43.5 tok/s (antekapetanovic.com). The MLX advantage is most pronounced on MoE models, reaching up to 3x, while on Dense models, it is around 1.4-1.6x.
An often-overlooked detail: on the same M4 Max, raw llama.cpp (Metal backend) performance is 89.4 tok/s, while Ollama only reaches 43.5 tok/s (GitHub #14861). Ollama’s Go wrapper layer consumes about 50% of the performance. This means that the community’s claim of MLX being 3x faster than Ollama is largely due to Ollama’s own architectural overhead, rather than MLX completely outclassing llama.cpp. The actual advantage of MLX over raw llama.cpp is roughly 1.4-1.8x.
MLX actually lags behind llama.cpp in time-to-first-token (TTFT). Real-world data from a developer on an M1 Max (Reddit r/LocalLLaMA) shows that with a prompt of about 650 tokens, the effective tok/s (combining prefill and decode) for MLX was only 13 tok/s, while GGUF reached 20 tok/s. This is because MLX spent 94% of its time on prefill.
This directly impacts the user experience. If you use a model as a coding agent (long outputs, generating many tokens per request), the prefill overhead is amortized over thousands of decode tokens, allowing the MLX advantage to shine. However, in chat scenarios (short prompts + short replies), where most of the time per request is spent on prefill, the overall speed of MLX may be slower than GGUF.
Memory efficiency is a more consistent advantage for MLX. Qwen3-Coder-30B-A3B occupies 34.7 GB on MLX compared to 40 GB on GGUF, a 13% saving (Reddit r/LocalLLM). Qwen3-235B-A22B uses 124 GB on MLX versus 133 GB on GGUF, a 7% saving (MacStories). The root cause is MLX’s use of unified memory for zero-copy tensor operations, which avoids moving data between the CPU and GPU. This difference is more significant for large models, as model weights are the primary consumers of memory bandwidth.
An arXiv paper (2511.05502) testing on an M2 Ultra showed that MLX has the most stable decode phase latency. However, community testing on an M3 Ultra (GitHub mlx-lm #763) indicated that at 30K+ context, MLX token generation is about 50% slower than llama.cpp with Flash Attention enabled. The friendliness to long context varies significantly between different models.
The MLX inference ecosystem grew rapidly in early 2026, with several inference servers emerging to address different scenarios.
Apple’s official mlx-lm (GitHub) is the most mature MLX inference tool, supporting both inference and LoRA/QLoRA fine-tuning, with a built-in OpenAI-compatible server mode. Rapid-MLX (GitHub, released 2026-03-23) is positioned as a drop-in replacement for Ollama, testing 2-4.2x faster than Ollama (llama.cpp backend) on an M3 Ultra. vLLM-MLX (GitHub) introduces continuous batching, increasing throughput by 3.4x with 5 concurrent requests. oMLX (GitHub) is specifically optimized for coding agent scenarios, using SSDs for KV cache persistence to compress TTFT for repeated prefixes from 30-90 seconds down to 1-3 seconds. LM Studio’s mlx-engine (GitHub, MIT licensed, though the main app is closed-source) added continuous batching for MLX in version 0.4.2 and supports automatic switching between MLX and GGUF backends.
Ollama’s role in this ecosystem is that of a general LLM tool. Starting with v0.19.0, Ollama automatically routes requests based on model format: GGUF files use llama.cpp (Metal backend), while safetensors files use MLX (Source Code). Its strengths are ease of use and a vast model library, though at the cost of performance loss from the Go wrapper layer. As of v0.19.0, the MLX runner supports six model architectures (Gemma 3, GLM-4 MoE Lite, the full Llama series, Qwen 3, Qwen 3.5, and Qwen 3.5 MoE), which is still early compared to the hundreds of architectures supported by llama.cpp.
Key capabilities currently missing from the MLX ecosystem include multi-LoRA serving (MOLA (GitHub) is the only exploration, currently in alpha), deterministic inference (MLX’s floating-point nondeterminism is a framework-level issue, adityakarnam.com), and cross-platform consistency.
The problems currently facing MLX are partly inherent to the framework and partly due to the early stage of the ecosystem.
At the framework level, prefill performance is the biggest drawback. MLX employs a full prefill strategy (unlike llama.cpp, which has a sliding-window cache), leading to significant overhead in short prompt scenarios. MLX also suffers from floating-point nondeterminism (where changes in batch size can produce significant numerical differences), posing a risk for scenarios requiring deterministic output.
In terms of compatibility, Metal 4 (introduced with macOS 15 and M4+
chips) has tightened type checking. For example, Ollama users on M5
chips have encountered bfloat/half type
mismatches that prevent models from running (#14432), and
intermittent GPU path crashes have been reported on M4 Pro with macOS
26.2 (#14611). These
issues partly stem from the rapid evolution of the MLX-C API (with
breaking changes between 0.5.0 and 0.6.0), leaving downstream tools
chasing a moving target.
At the ecosystem level, the fragmentation of over eight independent inference servers is typical of an early stage. Each server addresses different bottlenecks (continuous batching, KV cache persistence, hot-swapping LoRAs), and the market will eventually converge on a few dominant solutions. However, the risk of framework lock-in is currently high.
MLX remains primarily focused on Apple Silicon. Although a Linux CUDA
backend has been added (pip install mlx[cuda]), this is
intended more to allow researchers to develop MLX models in non-Mac
environments rather than to compete with the native CUDA ecosystem. On
NVIDIA GPUs, vLLM on an A100 can achieve several times the throughput of
MLX on an M5.
It is highly likely that MLX will become the standard inference framework on Apple Silicon. Apple optimized its hardware for MLX at the chip design stage (M5 Neural Accelerators), provided official backing with three sessions at WWDC 2025, and Ollama’s strategic shift serves as further validation. While the Metal backend of llama.cpp translates CUDA compute patterns into Metal shaders, MLX is designed from the ground up based on the hardware characteristics of Apple Silicon. This gap will only continue to grow with future chip iterations.
However, there is still a distance to go before it can be relied upon completely. The framework API is still changing rapidly, Metal 4 compatibility issues affect users of the latest hardware, and prefill performance lags behind llama.cpp in short conversation scenarios.
For users running local models on Mac, a pragmatic decision framework is as follows: if your primary use case involves long outputs (coding agents, long-form generation), the decode advantage of MLX can be fully realized, and you should consider trying mlx-lm or Rapid-MLX as an inference backend. If your use case involves short conversations or requires broad model compatibility, llama.cpp (either through Ollama or standalone) remains the more stable choice. If you are building a product, you should begin evaluating a migration path to MLX in the medium term, but llama.cpp remains a safer foundation for the short term.
Regardless of the path you choose, keep an eye on mlx-lm—it is the officially maintained inference tool from Apple and serves as the most stable benchmark for the entire MLX ecosystem.