Skip to main content

Local Models with Ollama

Run open-source LLMs on your own hardware via Ollama. Zero per-request cost, nothing leaves your machine, and the orchestrator talks to it over an OpenAI-compatible API.

Pros: complete data confidentiality, no API cost, no internet needed after the first download. Cons: requires capable hardware, slower than cloud APIs, you're on your own for model updates, and the machine is a single point of failure.

Setup

1. Install Ollama

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

Windows: download from ollama.ai/download.

2. Pull a model

ollama pull llama3        # 8B, general purpose, 8 GB RAM
ollama pull mistral # 7B, fast, 8 GB RAM
ollama pull mixtral # 8x7B, higher quality, 32 GB RAM
ollama list # see what you have

3. Run the server

ollama serve              # listens on http://localhost:11434

4. Configure in the orchestrator

Provider: ollama
API URL: http://localhost:11434/v1
Model: llama3
Authentication: None

Hardware requirements

Model sizeRAMGPU (recommended)
3B (Phi-3)4 GBOptional — runs on CPU
7B8 GB6 GB VRAM
13B16 GB8 GB VRAM
34B32 GB16 GB VRAM
70B64 GB24 GB+ VRAM

A dedicated NVIDIA GPU with CUDA makes a significant difference to response times. Ollama uses the GPU automatically when available — check with ollama run llama3 --verbose (look for "using GPU").

Remote access

To let the orchestrator reach Ollama running on a different machine:

OLLAMA_HOST=0.0.0.0 ollama serve

Then point the orchestrator at the machine's LAN IP instead of localhost: API URL: http://192.168.1.100:11434/v1.

Custom model configuration

If you want tuned parameters or a baked-in system prompt, create a Modelfile:

cat << 'EOF' > Modelfile
FROM llama3
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a helpful assistant for the 6022 protocol."
EOF

ollama create my-agent-model -f Modelfile

Then use Model: my-agent-model in the orchestrator.

Supported models

Llama 3, Mistral, Mixtral, Qwen, Phi-3, CodeLlama, and anything else in the Ollama library. RAM needs scale with parameter count — see the table above.

Troubleshooting

  • Connection refusedollama serve isn't running, or you're trying to reach it from another machine without OLLAMA_HOST=0.0.0.0. Test with curl http://localhost:11434/api/tags.
  • Out of memory when loading the model — use a smaller variant (7B instead of 70B), close other applications, or add RAM / GPU.
  • Slow responses — add GPU acceleration, shrink the context window in your prompts, or switch to a quantised model (q4_0, q5_1).
  • Model not found — check ollama list to confirm the model is pulled locally, and verify the spelling in the orchestrator matches.