Ollama makes running open-weight LLMs locally genuinely easy. Three commands — install, pull a model, run — and you have a local inference server with an OpenAI-compatible API. As of 2026, the tool has 100K+ GitHub stars and ships production-grade model management.
Installation: `curl -fsSL https://ollama.com/install.sh | sh` on Linux/Mac, or download the Windows installer. Then `ollama pull llama3.2` to fetch Meta's 3B parameter model (2.0GB), or `ollama pull llama3.1:70b` for the 70B variant (40GB). `ollama serve` starts the API server on port 11434, compatible with the OpenAI API format — swap `https://api.openai.com` for `http://localhost:11434` in your existing code and most clients work without changes.
Model selection depends on your hardware. 8GB RAM: Llama 3.2 3B, Gemma 2 2B, Phi-3 mini — fast inference, good for classification and simple generation. 16GB RAM: Llama 3.2 8B, Mistral 7B, Gemma 2 9B — quality comparable to GPT-3.5, fast on Apple M-series chips. 32GB+ RAM: Llama 3.1 70B quantized (Q4), Mixtral 8x7B — quality approaching GPT-4o for many tasks. GPU: any NVIDIA RTX 3080 or better dramatically accelerates inference — a 7B model runs 80-120 tokens/second on RTX 4090 vs 15-20 tokens/second on Apple M3 CPU.
When local makes sense: privacy (no data leaves your machine), cost at scale (zero marginal API cost once hardware is amortized), offline capability, regulatory requirements prohibiting cloud AI, and latency for on-device applications. A developer running 10 million tokens/day locally saves $750-1,500/day vs Claude 3.5 Sonnet API rates.
When local doesn't make sense: you need frontier quality (Llama 3.1 70B is close to GPT-4o on coding but behind on complex reasoning), you're prototyping (API is faster to start), or your team doesn't have hardware. A MacBook Pro M3 with 16GB is excellent for 7B models but runs the 70B model slowly — 3-5 tokens/second is too slow for interactive use.
Ollama's model library covers 100+ models including code-specific models (DeepSeek Coder V2, Qwen2.5-Coder) that outperform general models on programming tasks at the same parameter count.