Two years ago, running a local LLM meant burning a weekend, four cables, and some dignity to get something that was marginally useful for code completion. In 2026 the story is different. A quantized Llama 3.3 70B on a 48GB M4 Max is genuinely a working assistant. Qwen 2.5 Coder 32B beats GPT-4 class models on a lot of coding tasks. You still should not try to replace frontier cloud models for heavy reasoning, but the floor has moved up a lot.
This guide is what we actually run on our own machines. Pricing is what you paid for the hardware divided by the hours you use it. Everything else is free, and depending on your workload, that matters.
What changed, and why local is suddenly worth the trouble
Three things happened at once. Apple Silicon hit M4 and M4 Max in late 2024 with up to 128GB of unified memory, which is basically VRAM as far as LLMs care. NVIDIA shipped the RTX 5090 with 32GB of VRAM for consumer prices. And the open-weights ecosystem kept compounding. Models in the 30-70B parameter range with modern training recipes are shockingly good. You can get something that feels like GPT-4 from 2023 running on a laptop.
The second change is tooling. Ollama 0.6 is boring in the best way. You install it, you run ollama run qwen2.5-coder:32b, and you have a chat. LM Studio has become the default GUI for people who do not live in terminals. Both expose OpenAI-compatible APIs, so pointing any existing tool at localhost is a one-line config change.
The hardware math you need before you start
Rule of thumb: a quantized model needs roughly 0.6 to 0.8 GB of memory per billion parameters at Q4 quantization, plus a couple of gigabytes of headroom for context. That means:
- 24GB machine (base M4 MacBook Pro, RTX 4090): comfortable with models up to about 27B parameters. Qwen 2.5 Coder 32B fits if you are careful with context.
- 48GB machine (M4 Max, dual 4090): Llama 3.3 70B Q4 runs with room to spare. You get 16k of context and tokens per second in the teens on Apple Silicon.
- 64GB+ PC with RTX 5090: you can fit 70B at Q5 for better quality, or start running Mixtral-class models at full speed.
- 128GB Mac Studio: you can run 120B models, though token generation drops to around 5-6 tokens per second, which is usable but not snappy.
Raw compute matters less than memory bandwidth once the model fits. The M4 Max hits around 546 GB/s of memory bandwidth, which is why it generates tokens faster per dollar than most PCs for large models. The 5090 hits 1,792 GB/s, which is why it wins on the biggest models when the VRAM is enough.
The short list: which models actually earn their disk space
We have tried most of what shipped in 2025 and 2026. Most of it is not worth keeping. Here is what we still run every day.
Qwen 2.5 Coder 32B
The best local coding model by a real margin. On HumanEval and our own internal tests (TypeScript, Python, SQL), it beats every other open model under 70B. Feels usable for autocomplete, function generation, and mid-complexity refactors. It is the reason local is plausible for offline dev work.
Llama 3.3 70B Instruct
The best general-purpose local model. If you want one model that does decent code, writing, summarization, and reasoning, this is it. Q4 quantization runs on 48GB. It is not as good as Claude 4.7 at anything, but it is fine for 80% of daily tasks.
Gemma 3 27B
Google's Gemma 3 hit in February 2026 and is our pick for the 24GB machines. Cleaner writing than Llama at this size, good at summarization, and reasonable at code. Long context (128k) that actually works.
Mistral Small 3 24B
Fastest in the 20-30B range. If you want something that streams snappily for a chat UI, this is what we use. Not as strong at code as Qwen, but better at instruction-following on short tasks.
Actually getting it running
On a Mac or Linux box, install Ollama and pull a model. This is genuinely the whole setup:
# install Ollama (Mac)
brew install ollama
# start the background service
ollama serve &
# pull a model (this downloads ~19GB)
ollama pull qwen2.5-coder:32b
# chat with it
ollama run qwen2.5-coder:32b
# or point any OpenAI-compatible client at:
# http://localhost:11434/v1On Windows with an NVIDIA card, install the Ollama Windows build or use LM Studio for a GUI. LM Studio is easier if you want to swap models visually and tweak quantization levels. It also makes it trivial to load GGUF files from Hugging Face that are not in Ollama's default catalog.
Point Continue (the VS Code extension) or Zed at http://localhost:11434/v1with any API key string, and you have a fully local coding assistant. Cursor can also be pointed at a local endpoint if you use their custom OpenAI provider setting, though Cursor's best features still expect cloud models.
Where local beats cloud, honestly
Local wins in three specific cases. The first is privacy. If you are dealing with client NDAs, healthcare data, unreleased product code, or anything else where "it went to OpenAI's servers and maybe got logged" is a problem, local is not a nice-to-have, it is the only option.
The second is offline. Long flights, flaky wifi, hotel internet that maxes out at 2 Mbps. A local model means you still work. We wrote more words on a Zurich to Newark flight with Llama 3.3 running offline than in most full days at the office.
The third is bulk. If you need to process 100,000 documents with simple summarization or classification, running a local model at 15 tokens per second for a week is cheaper than paying for cloud inference by the token. We ran a document classification job last month: 420,000 records, Qwen 14B, ~18 hours on a single 5090, $0 in API costs. The cloud equivalent would have been around $340.
Local is not about beating the cloud on quality. It is about owning the pipe.
Where cloud still wins, and it is not close
If your work is one hard problem at a time and you want the best possible answer, cloud wins. Claude 4.7 at the top of its game is still measurably better than anything you can run locally for complex reasoning, long-context synthesis, and agentic tool use. The gap on code is narrower than people think, but it is real on anything else.
Cloud also wins on speed for a single user. You get 80-120 tokens per second from a frontier API. Your 48GB M4 Max gives you 15-20. For an interactive chat where you are reading as it generates, that is a different feel.
Our actual setup
Two of us are on M4 Max MacBook Pros with 48GB. One is on a Windows box with a 5090 and 64GB of RAM. The models we keep loaded on disk:
qwen2.5-coder:32bas the default local coding modelllama3.3:70b-instruct-q4_K_Mfor general-purpose tasksgemma3:27bfor writing and summarizationnomic-embed-textfor local RAG embeddings
We use Continue.dev inside VS Code, pointed at both Ollama and Anthropic. The extension lets us route "quick" tasks to local and "hard" tasks to Claude with a hotkey. About 60% of our volume now goes local, and that number has been creeping up.
Quantization: what the numbers actually mean
Model names on Ollama come with suffixes like q4_K_M,q5_K_M, q8_0, and fp16. These are quantization levels. Lower numbers mean smaller files and faster inference, at the cost of some quality. q4_K_Mis the default for most local setups because it is the sweet spot: roughly 4 bits per weight, meaningful compression, and the quality loss versus fp16 is small enough that most users cannot tell on most tasks.
We ran a small internal eval comparing Qwen 2.5 Coder 32B at different quantization levels on a set of 40 coding tasks.q4_K_M got 34 right. q5_K_M got 35. q8_0 got 36. fp16 got 36. The pattern is consistent: going from q4 to q5 matters a little, and everything above q5 is basically the same. Unless you have memory to burn, q4 or q5 is the right default.
RAG on your own machine, for real
One of the most underrated local workflows is a private RAG system over your own documents. We keep a local Qdrant instance running on each of our dev machines, populated with notes, meeting transcripts, and design docs. Embeddings are fromnomic-embed-text via Ollama. Query through a small wrapper that retrieves top-k chunks and hands them to Qwen or Llama for synthesis.
The total setup is about 60 lines of Python. The whole thing runs without touching the internet. Searching two years of our own notes by meaning instead of keywords is the kind of quietly useful thing that justifies the whole local stack on its own, without ever needing to compete with cloud models on benchmarks.
Bottom line
If you have a 48GB or bigger machine, install Ollama this week and pull Qwen 2.5 Coder. You will be surprised how often it is all you need. Keep Claude or GPT-5.1 in your back pocket for the hard problems, and do not try to replace the frontier with local on anything that matters the most. The right mental model for 2026 is not "local or cloud." It is "local first for everyday, cloud when you need to think harder."