5 min read

The Ultimate LLM Providers Shopping Guide

How to run your agentic workflow cheaply—and completely ruin top AI companies' business models

LLM AI LLM routing
The Ultimate LLM Providers Shopping Guide

What Is the Most Cost-Effective Way to Run a Large Language Model (LLM)?

When leveraging Large Language Models, you face a key decision: where should you run them? Options have multiplied, each with its own trade-offs. Should you:

  1. Purchase dedicated hardware to run an open-source LLM locally?
  2. Use a cloud service like OpenAI or Google Gemini, or pay for inference on a platform that hosts open‑source LLMs such as Groq or DeepInfra?
  3. Rent raw GPU power from a cloud provider like Modal to host an open‑source model yourself?

I’ve explored all three paths, and here are my findings.

Running Open‑Source LLMs Directly on Your Machine — or Why Apple Stock Is Going to Explode

If you want to run an LLM locally, you first need to pick a worthwhile open‑source model.

You want the smartest open‑source model you can run with the lowest hardware requirements.

To do that, look at leaderboards like:

I prefer Artificial Analysis because it computes an “intelligence index,” a composite of multiple renowned benchmarks for each model.

As of now, the new gpt-oss-120B appears to be the cheapest to run on cloud providers while maintaining a strong score (61), placing it second in intelligence. If it’s cheaper to run in the cloud, it’s likely less computationally demanding—and the same will hold on your hardware too.

Now, gpt-oss-120B needs high‑end hardware like an NVIDIA A100 or H100 with 80 GB of VRAM (memory dedicated to the GPU).

The interesting part is that Apple’s architecture blends VRAM and RAM and elude this reliance on the GPU only.

Apple’s Secret Sauce: Unified Memory Architecture (UMA)

On traditional PCs, the CPU has its own memory (RAM), and the GPU has its own separate, faster memory (VRAM). When running an LLM, data must constantly be copied back and forth between these two separate memory pools, creating a significant bottleneck that slows everything down.

Apple’s M‑series chips, such as the M4 Max, use a different approach.

Shared Memory Pool: With UMA, the CPU, GPU, and Neural Engine (NPU) share a single, large pool of high‑speed memory. Your local RAM is this unified pool.

No Copying Needed: Because all components are using the same memory, there’s no need to duplicate or transfer huge amounts of data. The GPU can directly access the model data the CPU is using with minimal delay. This dramatically increases efficiency and speed.

That’s why some people have had pretty decent results (40 tokens/sec) running gpt-oss-120B on an M4 Max: https://www.reddit.com/r/LocalLLM/comments/1mix4yp/getting_40_tokenssec_with_latest_openai_120b/

For comparison, ChatGPT online delivers around 100 tokens/sec. That said, a 128 GB RAM MacBook Pro M4 Max currently sells for $6,399—not cheap—but for a company running LLMs locally and paying only for electricity, it can make sense. Apple may be behind in the LLM revolution, but it owns the hardware to make it scalable and affordable. Waiting for Wall Street to realize it (this is not financial advice).

I Tested GPT-OSS-20B on My MacBook Air M2 (24 GB)

Of course, my MacBook Air couldn’t run the largest GPT‑OSS model, but I did run the 20B version and got ~20 tokens/sec, which was okay.

If you want to try it, here’s the simplest way:

  • Download LM Studio from here
  • Open the app and click to download the GPT‑OSS‑20B model app
  • Click on the “Start chat” button start chat
  • You will see the chat interface chat interface
  • Ask a question. I asked, “How to clean jewelry the right way?” and here is the token generation rate: chatting
  • You can see the tokens‑per‑second stats at the bottom: token per seconds

⚠️ Quick Warning

Just a quick note: LM Studio or open‑webui are interesting for playing with a model, but they do not replace a chat interface like ChatGPT, Claude, or Gemini. LM Studio and open‑webui don’t provide the same agentic capabilities out-of-the-box (web search, web scraping, document parsing, etc.) that these proprietary platforms offer to optimize model behavior. You can try to configure those capabilities in open‑webui, but you’ll likely not reach the same level of quality and performance as those online platforms.

Running Open‑Source LLMs on Inference Providers

Artificial Analysis can help you find the best inference providers. Basically, you want to sort models by intelligence index and compute the price per intelligence‑index point to find the best deal, then check which providers offer those models. I computed this and summarized it in the table below:

Model Artificial Analysis Intelligence Index Blended Price (USD/1M Tokens) Price per Unit of Intelligence (USD/1M tokens)
GPT-5 (high) 69 $3.44 $0.0499
GPT-5 (medium) 68 $3.44 $0.0506
Grok 4 68 $6.00 $0.0882
o3-pro 68 $35.00 $0.5147
o3 67 $3.50 $0.0522
o4-mini (high) 65 $1.93 $0.0297
Gemini 2.5 Pro 65 $3.44 $0.0529
GPT-5 mini 64 $0.69 $0.0108
Qwen3 235B 2507 (Reasoning) 64 $2.63 $0.0411
GPT-5 (low) 63 $3.44 $0.0546
gpt-oss-120B (high) 61 $0.26 $0.0043
Claude 4.1 Opus Thinking 61 $30.00 $0.4918
Claude 4 Sonnet Thinking 59 $6.00 $0.1017
DeepSeek R1 0528 59 $0.96 $0.0163
Gemini 2.5 Flash (Reasoning) 58 $0.85 $0.0147
Grok 3 mini Reasoning (high) 58 $0.35 $0.0060
GLM-4.5 56 $0.96 $0.0171
o3-mini (high) 55 $1.93 $0.0351
Claude 4 Opus Thinking 55 $30.00 $0.5455
GPT-5 nano 54 $0.14 $0.0026
Qwen3 30B 2507 (Reasoning) 53 $0.75 $0.0142
MiniMax M1 80k 53 $0.82 $0.0155
o3-mini 53 $1.93 $0.0364
Llama Nemotron Super 49B v1.5 (Reasoning) 52 $0.00 $0.0000
MiniMax M1 40k 51 $0.82 $0.0161
Qwen3 235B 2507 (Non-reasoning) 51 $1.23 $0.0241
EXAONE 4.0 32B (Reasoning) 51 $0.70 $0.0137
GLM-4.5-Air 49 $0.42 $0.0086
gpt-oss-20B (high) 49 $0.09 $0.0018
Claude 4.1 Opus 49 $30.00 $0.6122
Kimi K2 49 $1.07 $0.0218
QwQ-32B 48 $0.49 $0.0102
Gemini 2.5 Flash 47 $0.85 $0.0181
GPT-4.1 47 $3.50 $0.0745
Claude 4 Opus 47 $30.00 $0.6383
Llama Nemotron Ultra Reasoning 46 $0.90 $0.0196
Qwen3 30B 2507 (Non-reasoning) 46 $0.35 $0.0076
Claude 4 Sonnet 46 $6.00 $0.1304
Grok 3 Reasoning Beta 46 $0.00 $0.0000
Qwen3 Coder 480B 45 $3.00 $0.0667
Gemini 2.5 Flash-Lite (Reasoning) 44 $0.17 $0.0039
GPT-5 (minimal) 44 $3.44 $0.0782
Solar Pro 2 (Reasoning) 43 $0.50 $0.0116
GPT-4.1 mini 42 $0.70 $0.0167
Llama 4 Maverick 42 $0.39 $0.0093
DeepSeek R1 0528 Qwen3 8B 42 $0.07 $0.0017
Llama 3.3 Nemotron Super 49B Reasoning 40 $0.00 $0.0000
EXAONE 4.0 32B 40 $0.70 $0.0175
Grok 3 40 $6.00 $0.1500
Mistral Medium 3 39 $0.80 $0.0205
Magistral Medium 38 $2.75 $0.0724
Reka Flash 3 36 $0.35 $0.0097
Magistral Small 36 $0.75 $0.0208
Nova Premier 35 $5.00 $0.1429
Gemini 2.5 Flash-Lite 35 $0.17 $0.0049
Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) 34 $0.00 $0.0000
Qwen3 Coder 30B 33 $0.90 $0.0273
Solar Pro 2 33 $0.50 $0.0152
Llama 4 Scout 33 $0.20 $0.0061
Mistral Small 3.2 32 $0.15 $0.0047
Command A 32 $4.38 $0.1369
Devstral Medium 31 $0.80 $0.0258
Llama 3.3 70B 31 $0.58 $0.0187
GPT-4.1 nano 30 $0.17 $0.0057
Llama 3.1 405B 29 $3.25 $0.1121
MiniMax-Text-01 29 $0.42 $0.0145
GPT-4o (ChatGPT) 29 $7.50 $0.2586
Llama 3.3 Nemotron Super 49B v1 28 $0.00 $0.0000
Phi-4 28 $0.22 $0.0079
Llama Nemotron Super 49B v1.5 27 $0.00 $0.0000
Llama 3.1 Nemotron 70B 26 $0.17 $0.0065
Gemma 3 27B 25 $0.00 $0.0000
GPT-4o mini 24 $0.26 $0.0108
Gemma 3 12B 24 $0.06 $0.0025
R1 1776 22 $3.50 $0.1591
Llama 3.2 90B (Vision) 22 $0.54 $0.0245
Devstral Small 21 $0.15 $0.0071
Gemma 3n E4B 18 $0.03 $0.0017
DeepHermes 3 - Mistral 24B 18 $0.00 $0.0000
Jamba 1.7 Large 18 $3.50 $0.1944
Granite 3.3 8B 18 $0.09 $0.0050
Codestral (Jan '25) 16 $0.45 $0.0281
Phi-4 Multimodal 15 $0.00 $0.0000
Gemma 3 4B 14 $0.03 $0.0021
Llama 3.2 11B (Vision) 13 $0.10 $0.0077
Gemma 3n E2B 10 $0.00 $0.0000
Ministral 8B 10 $0.10 $0.0100
Aya Expanse 32B 8 $0.75 $0.0938
Ministral 3B 8 $0.04 $0.0050
Jamba 1.7 Mini 6 $0.25 $0.0417
DeepHermes 3 - Llama-3.1 8B 4 $0.00 $0.0000
Aya Expanse 8B 4 $0.75 $0.1875
Gemma 3 1B 1 $0.00 $0.0000

Source: Artificial Analysis

You can immediately see the good options:

  • GPT‑5‑mini: $0.01 per million tokens (OpenAI)
  • gpt‑oss‑120B: $0.005 per million tokens (open source; multiple providers)

If you choose a proprietary model like GPT‑5‑mini, you can go to the provider’s website (platform.openai.com) and buy credits.

For gpt‑oss‑120B, you can use OpenRouter to see which providers offer it and their pricing:

openrouter

Source: openrouter.ai

If you don’t want to switch providers when they change prices, you can use OpenRouter directly to handle that for you. Most models follow the OpenAI JSON format, so you can use the OpenAI SDK and Agents SDK across providers—even for non‑OpenAI models.

OpenRouter takes a 5.5% markup on the credits you buy to pay for this broker service.

Other platforms like

also exist and offer additional features.

Running Open‑Source Models on Rented GPU Infrastructure

Well, this is the most disappointing option. I tried it with Modal. I deployed the smaller gpt‑oss‑20B model on an H200 GPU and, a few minutes later, I burned through my $5 starter credits. In comparison, I ran multiple heavy experiments over a full day with GPT‑5‑mini on the OpenAI platform and it cost me only 9 cents.

With Modal, by contrast, I did 12 small prompts to the model for $5.40 🤑:

modal

This is because Modal reserves your infrastructure until you explicitly stop it. But if you stop it right after a prompt, you’ll need to wait for the infrastructure to provision again before the next prompt. This is fast on Modal—thanks to significant R&D mitigating Docker and Kubernetes filesystem boot‑time overhead—but in my view it’s still too slow for production.

Conclusion

If you optimize purely for cost today, OpenAI wins. For proprietary models, GPT‑5‑mini is currently the cheapest at roughly $0.01 per million tokens. For open‑source models, the lowest prices (e.g., gpt‑oss‑120B around $0.005 per million tokens) are easily accessed through the OpenAI‑compatible ecosystem via gateways like OpenRouter, so you can standardize on the OpenAI SDK and still route to the cheapest providers.

Companies developing agentic workflows can lower their cost by using GPT-OSS-120B on a well equipped Mac Mini that could act an inference dev server in the local network too.

Note: prices and availability change frequently—always recheck before committing a stack.