Ollama Guide 2026: Run Local LLMs — Setup, Models, Performance

Ollama makes running open-weight LLMs locally genuinely easy. Three commands — install, pull a model, run — and you have a local inference server with an OpenAI-compatible API. As of 2026, the tool has 100K+ GitHub stars and ships production-grade model management.

Installation: `curl -fsSL https://ollama.com/install.sh | sh` on Linux/Mac, or download the Windows installer. Then `ollama pull llama3.2` to fetch Meta's 3B parameter model (2.0GB), or `ollama pull llama3.1:70b` for the 70B variant (40GB). `ollama serve` starts the API server on port 11434, compatible with the OpenAI API format — swap `https://api.openai.com` for `http://localhost:11434` in your existing code and most clients work without changes.

Model selection depends on your hardware. 8GB RAM: Llama 3.2 3B, Gemma 2 2B, Phi-3 mini — fast inference, good for classification and simple generation. 16GB RAM: Llama 3.2 8B, Mistral 7B, Gemma 2 9B — quality comparable to GPT-3.5, fast on Apple M-series chips. 32GB+ RAM: Llama 3.1 70B quantized (Q4), Mixtral 8x7B — quality approaching GPT-4o for many tasks. GPU: any NVIDIA RTX 3080 or better dramatically accelerates inference — a 7B model runs 80-120 tokens/second on RTX 4090 vs 15-20 tokens/second on Apple M3 CPU.

When local makes sense: privacy (no data leaves your machine), cost at scale (zero marginal API cost once hardware is amortized), offline capability, regulatory requirements prohibiting cloud AI, and latency for on-device applications. A developer running 10 million tokens/day locally saves $750-1,500/day vs Claude 3.5 Sonnet API rates.

When local doesn't make sense: you need frontier quality (Llama 3.1 70B is close to GPT-4o on coding but behind on complex reasoning), you're prototyping (API is faster to start), or your team doesn't have hardware. A MacBook Pro M3 with 16GB is excellent for 7B models but runs the 70B model slowly — 3-5 tokens/second is too slow for interactive use.

Ollama's model library covers 100+ models including code-specific models (DeepSeek Coder V2, Qwen2.5-Coder) that outperform general models on programming tasks at the same parameter count.

Frequently Asked Questions

How does Ollama compare to running llama.cpp directly?

Ollama wraps llama.cpp with model management, an OpenAI-compatible API server, and a model registry. Direct llama.cpp gives more control over quantization and inference parameters, but requires manual management of model files and server setup. For most developers, Ollama's DX advantage is worth the minor abstraction overhead.

Can Ollama run on Windows?

Yes, since version 0.1.17. Ollama on Windows supports NVIDIA and AMD GPUs via DirectML, plus CPU inference. Performance on Windows with an NVIDIA GPU is comparable to Linux. Apple Silicon is still the best consumer hardware for local inference due to unified memory architecture — an M3 Max with 128GB RAM can run 70B models at usable speeds.

How much does local inference cost compared to cloud APIs?

Hardware cost amortized over 3 years: a Mac Studio M4 Ultra ($3,999) runs 70B models at reasonable speed. At 5M tokens/day vs Claude 3.5 Sonnet ($3/million input, $15/million output average $9 blended): local saves ~$45/day, payback in under 3 months. For lower volume, cloud APIs win on total cost — the hardware is over-provisioned.

Ollama: Run Production LLMs on Your Own Hardware

Frequently Asked Questions

How does Ollama compare to running llama.cpp directly?

Can Ollama run on Windows?

How much does local inference cost compared to cloud APIs?

Start Using GitIntel Free

Frequently Asked Questions

How does Ollama compare to running llama.cpp directly?

Can Ollama run on Windows?

How much does local inference cost compared to cloud APIs?

Start Using GitIntel Free

Related Tools