8 min read

Run Claude Code With a Local Model on Your Mac

One command to turn your MacBook into a free, private AI coding agent using MLX and open-source models

LLM AI Claude Code MLX Apple Silicon Local LLM
Run Claude Code With a Local Model on Your Mac

Claude Code Is Amazing. But Should It See Your Secrets?

If you’ve used Claude Code, you know it’s one of the best agentic coding tools available. It reads your codebase, edits files, runs tests, and iterates until the job is done. But every token goes through Anthropic’s API — which means your code, your environment variables, and anything the agent touches leaves your machine.

Now imagine telling it: “deploy this to production.” It will run kubectl apply, gh pr merge, terraform apply. It will read your ~/.kube/config, your AWS credentials, your .env files. All of that context goes to an external API. Are you comfortable with that?

What if you could keep the harness — the tool use, the multi-step planning, the error recovery — but swap the brain for a free, local model where nothing leaves your machine?

That’s exactly what I built: claude-code-local-mlx.

One Command, Zero Config

uv tool run --from git+https://github.com/GuillaumeBlanchet/claude-code-local-mlx.git claude-local

That’s it. On first run it:

  1. Clones a lightweight proxy server (claude-code-mlx-proxy)
  2. Downloads the model from HuggingFace (cached after the first time)
  3. Starts the proxy on a local port
  4. Launches Claude Code pointed at your local model

On subsequent runs, if the model is already loaded, it skips straight to Claude Code. No config files, no Docker, no environment variable dance.

Want it permanently installed?

uv tool install git+https://github.com/GuillaumeBlanchet/claude-code-local-mlx.git
claude-local

Why MLX and Not Ollama?

Fair question. Ollama now has native Claude Code support — you can do ollama launch claude --model qwen3.5 and it works. If you want the simplest possible setup, Ollama is a fine choice.

But this tool uses mlx-lm directly, and there are measurable reasons:

Ollamamlx-lm (this tool)
SetupOne commandOne command
Model formatGGUF (llama.cpp)Native MLX safetensors
Inference speed~112 tok/s~130 tok/s
Memory overheadHigher (HTTP + GGUF conversion)Lower (zero-copy, in-process)
Model selectionOllama registryAny mlx-community/ model on HuggingFace

Benchmark: Qwen3.5-35B on M4 Max. Source: antekapetanovic.com

The ~15-20% speed difference comes from eliminating the HTTP/JSON layer and using native MLX models that load directly into Apple Silicon’s unified memory with zero copy. For a coding agent making hundreds of calls per session, this compounds.

Ollama also uses GGUF models even with its MLX backend — it converts GGUF weights to MLX tensors at runtime. This tool uses native mlx-community/ models from HuggingFace with MLX-specific quantization formats (MXFP4, DWQ) that are optimized for Apple’s Metal GPU.

What Model Should I Use?

The default is mlx-community/Qwen3-Coder-Next-8bit, an 80B Mixture-of-Experts model with only 3B parameters active per token — fast and smart. But it eats 79 GB of RAM at runtime (not the 45 GB the file size suggests), so you need a 96+ GB machine.

Here’s what I recommend for each RAM tier, with real-world runtime memory and speed measured on M4 Max:

RAMModelRuntime RAMtok/sComparable to
8 GBQwen2.5-Coder-3B 4-bit~3-4 GB~200GPT-3.5 Turbo
16 GBQwen2.5-Coder-7B 4-bit~7-8 GB~90GPT-4o-mini
24 GBGemma-4-26B-A4B 4-bit~17-20 GB~110GPT-4o (code gen)
32 GBQwen2.5-Coder-32B 4-bit~25-30 GB~14-30GPT-4o (editing)
48 GBQwen3-Coder-Next 4-bit~40-46 GB~40-60Claude Sonnet 4
64 GBQwen2.5-Coder-32B 8-bit~33 GB~10-14GPT-4o (near-lossless)
96 GBQwen3-Coder-Next 8-bit~79 GB~35-50Claude Sonnet 4
128 GBDevstral-2-123B 4-bit~72-90 GB~5-8GPT-4.1

To use a different model:

claude-local -m mlx-community/Qwen2.5-Coder-32B-Instruct-4bit

A Warning About RAM Estimates

Don’t trust “model file size” as your RAM estimate. The on-disk weights are just part of the picture. At runtime, the KV cache (which grows with context length), tokenizer, and framework overhead can add 50-80% on top. I learned this the hard way when Qwen3-Coder-Next-8bit (45 GB on disk) consumed 79 GB of actual RAM on my M4 Max.

The Harness Boost: Why Benchmarks Undersell Your Local Model

Here’s the thing most people miss: coding benchmarks (Aider, SWE-bench, HumanEval) test models with basic scaffolding — a simple loop of “read prompt, generate code, check.” Claude Code is far more sophisticated: it has tool use, iterative file editing, shell access, multi-step planning, and error recovery.

Research shows the harness matters more than the model at the frontier:

StudyBasic scaffoldOptimized scaffoldGain
Claude Opus 4.5 on CORE-Bench42%78% (with Claude Code)+36 pts
LangChain coding agent52.8%66.5%+14 pts
SWE-bench Pro~1 pt from model swaps~22 pts from scaffold swapsScaffold >> Model

Source: Particula

So a Qwen2.5-Coder-32B that benchmarks like GPT-4o in isolation? With Claude Code’s agentic harness, it can perform closer to Claude Sonnet 4 in practice. The model provides the intelligence; the harness multiplies it.

How It Works Under the Hood

Claude Code  ──  Anthropic Messages API  ──>  mlx-proxy  ──  MLX inference  ──>  Local Model
  (CLI)      <──     /v1/messages        ──  (port 18808) <──    tokens     ──  (Apple GPU)

Claude Code requires the Anthropic Messages API (/v1/messages). Local models don’t speak this protocol. The proxy (claude-code-mlx-proxy) bridges this gap: it accepts Anthropic-format requests, translates them into MLX model calls, and streams responses back in the expected format.

The tool handles the full lifecycle:

  • Clones the proxy repo to ~/.local/share/claude-local/proxy/
  • Upgrades dependencies (the upstream repo sometimes pins old versions of mlx-lm that don’t support newer model architectures)
  • Patches the proxy to accept Claude Code’s "adaptive" thinking mode
  • Starts the proxy, waits for the health endpoint, then execs into Claude Code with the right environment variables
  • Reuses an existing proxy if it’s already healthy (no model reload on subsequent launches)

Practical Tips

Kill the proxy when you’re done:

claude-local --kill

Force restart (e.g., to switch models):

claude-local --restart -m mlx-community/Gemma-4-26b-a4b-it-4bit

Proxy-only mode (if you want to use Claude Code separately):

claude-local --server
# Then in another terminal:
ANTHROPIC_BASE_URL=http://127.0.0.1:18808 claude

Check logs if something breaks:

cat ~/.local/state/claude-mlx-proxy.log

The Killer Use Case: DevOps Without Leaking Secrets

This is the use case that convinced me to build this tool.

When Claude Code runs a kubectl get pods, a gh pr create, or a terraform plan, the full output — including cluster names, namespaces, resource configurations, and error messages — gets sent to the API as context. When it reads your .env to understand a deployment, your database URLs, API keys, and cloud credentials are in the conversation. When it shells into your staging environment, the session output crosses the wire.

With a local model, none of this leaves your machine. The model runs on your GPU, the proxy runs on localhost, and your secrets stay in your RAM. You can let the agent kubectl apply -f deployment.yaml without worrying about what context it’s shipping to a third-party server.

And here’s the thing: you don’t need GPT-5-class intelligence for DevOps tasks. Running gh pr merge, kubectl rollout restart, or docker compose up doesn’t require deep reasoning. A 7B model can follow instructions, read command output, and retry on errors. Even the smallest models on the table above can handle:

  • gh workflows — creating PRs, merging, reviewing checks
  • kubectl operations — deploying, scaling, checking pod status, reading logs
  • docker compose — starting services, checking health, rebuilding
  • terraform / tofu — planning, applying, reading state
  • CI/CD debugging — reading build logs, fixing YAML, retrying pipelines

For these tasks, you don’t need frontier-model intelligence. An 80B Qwen3-Coder-Next running locally on a 96 GB Mac delivers Claude Sonnet 4-class performance at ~40 tok/s — and on a tighter budget, even a 32B model on a 32 GB MacBook Pro handles DevOps workflows well. Either way, infinitely more secure than sending your cloud credentials through an external API.

The Security Argument in One Sentence

If your AI agent can kubectl exec into your production cluster, you probably don’t want its brain hosted by a third party.

Conclusion

Running Claude Code with a local model isn’t just about saving money — it’s about security. Your secrets, your cloud credentials, your infrastructure state — none of it needs to leave your machine for an AI agent to be useful.

The model quality gap is closing fast. A well-harnessed Qwen3-Coder-Next on a MacBook Pro performs in the same league as commercial models that cost hundreds of dollars per month. And with Apple Silicon’s unified memory giving MLX a structural advantage, your Mac is quietly becoming one of the best AI development machines you can buy.

Try it:

uv tool run --from git+https://github.com/GuillaumeBlanchet/claude-code-local-mlx.git claude-local

The repo is at github.com/GuillaumeBlanchet/claude-code-local-mlx. PRs welcome.