Inference & PerformanceDeveloper Tools

MLX vs llama.cpp on Apple Silicon: Benchmarks, M5 Neural Accelerators, and Why Ollama Switched

Published Mar 31, 2026

TL;DR: On Apple Silicon, MLX now beats llama.cpp in steady-state decode speed in most independent benchmarks, and the M5’s Neural Accelerators give it a 4.06x time-to-first-token advantage that llama.cpp’s Metal backend cannot reach. That is why Ollama switched. This article covers what MLX is, the benchmark numbers, and where it still falls short.

On March 30, 2026, Ollama announced it was switching to MLX as its inference engine on Apple Silicon (currently in preview). While this news sparked widespread discussion, most of the focus remained on Ollama itself. What deserves more attention is the target of this switch: Apple’s MLX framework. Understanding what MLX is, what it gets right on Apple Silicon, the state of its ecosystem, and its current limitations is of greater long-term value than the Ollama update itself.

1. What is MLX

MLX is a machine learning framework designed by Apple specifically for its own silicon. Its core design revolves around the unified memory architecture of Apple Silicon: the CPU and GPU share the same physical memory, allowing tensor operations to be passed between them with zero-copy. This eliminates the overhead of CPU-to-GPU memory transfers found in traditional frameworks. For LLM inference tasks, where memory bandwidth is the primary bottleneck, this architectural advantage translates directly into performance gains.

The framework has 24.9k stars on GitHub, with its companion inference tool, mlx-lm, holding 4.3k stars. The mlx-community organization on HuggingFace has accumulated 4,316 pre-converted models, ensuring that community-quantized MLX versions of almost all major open-source models are available within days of their release.

2. Why Apple Silicon Users Should Care About MLX

M5 Neural Accelerators: Hardware Tailored for MLX

In January 2026, Apple published a research paper (Apple ML Research) demonstrating the performance of dedicated Neural Accelerators integrated into each GPU core of the M5 chip when running MLX. For Qwen3-14B-4bit, the time-to-first-token was 4.06x faster than the M4, and token generation was 1.19x faster. This improvement stems from hardware circuits specifically designed for MLX compute patterns, rather than simply higher GPU frequencies or increased bandwidth.

This sends a clear signal: Apple optimized its hardware for MLX during the chip design phase. Since the Neural Accelerator is specifically targeted at MLX compute graphs, the performance advantage of MLX on Apple Silicon will continue to widen with each chip iteration. In contrast, the Metal backend of llama.cpp essentially translates CUDA compute patterns into Metal shaders, incurring an abstraction loss and failing to automatically benefit from this specialized hardware.

Official Positioning at WWDC 2025

At WWDC 2025, Apple established MLX as the preferred framework for LLM inference on Apple Silicon through three dedicated sessions. They also released the Foundation Models framework—a Swift API for app developers that directly calls on-device models with approximately 3B parameters. MLX targets researchers and advanced developers, while Foundation Models serves app developers, creating a parallel two-tier approach. Ollama’s March 2026 announcement of its switch to MLX further validates this roadmap.

3. Performance

Decode Phase: The Strength of MLX

Multiple independent benchmarks support the advantage of MLX on Apple Silicon during the steady-state token generation phase.

On a Mac mini M4 Pro (64GB) running Qwen3-Coder-30B-A3B (an MoE architecture), MLX achieved approximately 130 tok/s, compared to 43 tok/s for Ollama with the llama.cpp backend—a 3x difference (Reddit r/LocalLLM). On an M4 Max (128GB) running Qwen3.5-35B-A3B, MLX reached 130 tok/s versus Ollama’s 43.5 tok/s (antekapetanovic.com). The MLX advantage is most pronounced on MoE models, reaching up to 3x, while on Dense models, it is around 1.4-1.6x.

An often-overlooked detail: on the same M4 Max, raw llama.cpp (Metal backend) performance is 89.4 tok/s, while Ollama only reaches 43.5 tok/s (GitHub #14861). Ollama’s Go wrapper layer consumes about 50% of the performance. This means that the community’s claim of MLX being 3x faster than Ollama is largely due to Ollama’s own architectural overhead, rather than MLX completely outclassing llama.cpp. The actual advantage of MLX over raw llama.cpp is roughly 1.4-1.8x.

Prefill Phase: The Weakness of MLX

MLX actually lags behind llama.cpp in time-to-first-token (TTFT). Real-world data from a developer on an M1 Max (Reddit r/LocalLLaMA) shows that with a prompt of about 650 tokens, the effective tok/s (combining prefill and decode) for MLX was only 13 tok/s, while GGUF reached 20 tok/s. This is because MLX spent 94% of its time on prefill.

This directly impacts the user experience. If you use a model as a coding agent (long outputs, generating many tokens per request), the prefill overhead is amortized over thousands of decode tokens, allowing the MLX advantage to shine. However, in chat scenarios (short prompts + short replies), where most of the time per request is spent on prefill, the overall speed of MLX may be slower than GGUF.

Memory Efficiency

Memory efficiency is a more consistent advantage for MLX. Qwen3-Coder-30B-A3B occupies 34.7 GB on MLX compared to 40 GB on GGUF, a 13% saving (Reddit r/LocalLLM). Qwen3-235B-A22B uses 124 GB on MLX versus 133 GB on GGUF, a 7% saving (MacStories). The root cause is MLX’s use of unified memory for zero-copy tensor operations, which avoids moving data between the CPU and GPU. This difference is more significant for large models, as model weights are the primary consumers of memory bandwidth.

Long Context

An arXiv paper (2511.05502) testing on an M2 Ultra showed that MLX has the most stable decode phase latency. However, community testing on an M3 Ultra (GitHub mlx-lm #763) indicated that at 30K+ context, MLX token generation is about 50% slower than llama.cpp with Flash Attention enabled. The friendliness to long context varies significantly between different models.

4. Ecosystem: Eight Inference Servers Competing for One Platform

The MLX inference ecosystem grew rapidly in early 2026, with several inference servers emerging to address different scenarios.

Apple’s official mlx-lm (GitHub) is the most mature MLX inference tool, supporting both inference and LoRA/QLoRA fine-tuning, with a built-in OpenAI-compatible server mode. Rapid-MLX (GitHub, released 2026-03-23) is positioned as a drop-in replacement for Ollama, testing 2-4.2x faster than Ollama (llama.cpp backend) on an M3 Ultra. vLLM-MLX (GitHub) introduces continuous batching, increasing throughput by 3.4x with 5 concurrent requests. oMLX (GitHub) is specifically optimized for coding agent scenarios, using SSDs for KV cache persistence to compress TTFT for repeated prefixes from 30-90 seconds down to 1-3 seconds. LM Studio’s mlx-engine (GitHub, MIT licensed, though the main app is closed-source) added continuous batching for MLX in version 0.4.2 and supports automatic switching between MLX and GGUF backends.

Ollama’s role in this ecosystem is that of a general LLM tool. Starting with v0.19.0, Ollama automatically routes requests based on model format: GGUF files use llama.cpp (Metal backend), while safetensors files use MLX (Source Code). Its strengths are ease of use and a vast model library, though at the cost of performance loss from the Go wrapper layer. As of v0.19.0, the MLX runner supports six model architectures (Gemma 3, GLM-4 MoE Lite, the full Llama series, Qwen 3, Qwen 3.5, and Qwen 3.5 MoE), which is still early compared to the hundreds of architectures supported by llama.cpp.

Key capabilities currently missing from the MLX ecosystem include multi-LoRA serving (MOLA (GitHub) is the only exploration, currently in alpha), deterministic inference (MLX’s floating-point nondeterminism is a framework-level issue, adityakarnam.com), and cross-platform consistency.

5. Current Limitations

The problems currently facing MLX are partly inherent to the framework and partly due to the early stage of the ecosystem.

At the framework level, prefill performance is the biggest drawback. MLX employs a full prefill strategy (unlike llama.cpp, which has a sliding-window cache), leading to significant overhead in short prompt scenarios. MLX also suffers from floating-point nondeterminism (where changes in batch size can produce significant numerical differences), posing a risk for scenarios requiring deterministic output.

In terms of compatibility, Metal 4 (introduced with macOS 15 and M4+ chips) has tightened type checking. For example, Ollama users on M5 chips have encountered bfloat/half type mismatches that prevent models from running (#14432), and intermittent GPU path crashes have been reported on M4 Pro with macOS 26.2 (#14611). These issues partly stem from the rapid evolution of the MLX-C API (with breaking changes between 0.5.0 and 0.6.0), leaving downstream tools chasing a moving target.

At the ecosystem level, the fragmentation of over eight independent inference servers is typical of an early stage. Each server addresses different bottlenecks (continuous batching, KV cache persistence, hot-swapping LoRAs), and the market will eventually converge on a few dominant solutions. However, the risk of framework lock-in is currently high.

MLX remains primarily focused on Apple Silicon. Although a Linux CUDA backend has been added (pip install mlx[cuda]), this is intended more to allow researchers to develop MLX models in non-Mac environments rather than to compete with the native CUDA ecosystem. On NVIDIA GPUs, vLLM on an A100 can achieve several times the throughput of MLX on an M5.

6. What This Means for Apple Silicon Users

It is highly likely that MLX will become the standard inference framework on Apple Silicon. Apple optimized its hardware for MLX at the chip design stage (M5 Neural Accelerators), provided official backing with three sessions at WWDC 2025, and Ollama’s strategic shift serves as further validation. While the Metal backend of llama.cpp translates CUDA compute patterns into Metal shaders, MLX is designed from the ground up based on the hardware characteristics of Apple Silicon. This gap will only continue to grow with future chip iterations.

However, there is still a distance to go before it can be relied upon completely. The framework API is still changing rapidly, Metal 4 compatibility issues affect users of the latest hardware, and prefill performance lags behind llama.cpp in short conversation scenarios.

For users running local models on Mac, a pragmatic decision framework is as follows: if your primary use case involves long outputs (coding agents, long-form generation), the decode advantage of MLX can be fully realized, and you should consider trying mlx-lm or Rapid-MLX as an inference backend. If your use case involves short conversations or requires broad model compatibility, llama.cpp (either through Ollama or standalone) remains the more stable choice. If you are building a product, you should begin evaluating a migration path to MLX in the medium term, but llama.cpp remains a safer foundation for the short term.

Regardless of the path you choose, keep an eye on mlx-lm—it is the officially maintained inference tool from Apple and serves as the most stable benchmark for the entire MLX ecosystem.

How LLM Inference Works: Following the SGLang Omni Team’s Design Thinking — The data-center counterpart to MLX’s edge approach: understanding inference system design across hardware contexts
Open-Source Model Inference Buying Guide: GLM-5.1, DeepSeek V4 Pro, Kimi K2.6 — API, Subscriptions, and Ollama Cloud Compared — The practical procurement decision downstream of the MLX switch: how local inference fits into the API vs subscription vs cloud cost comparison
How to Run DeepSeek V4 Flash Locally on Mac: A Deep Dive into the DS4 Engine — An alternative local inference path on Mac: DS4 and MLX represent different architectural bets for running capable models at the edge