08 · Inference Options — Providers, Models, and Runtime Switching

Source: the nemoclaw-user-configure-inference skill. This doc is about which model answers your agent — how NemoClaw presents providers at onboarding, how runtime switching actually works, and the gotchas for local model servers.

The one thing to internalize first

Inside the sandbox, the agent always talks to inference.local. It never sees real URLs or real credentials. Everything below is about what that host-side label routes to.

This means "switching models" has two flavors, and they are not interchangeable:

Kind of switch	What changes	How you do it	Lifecycle
Same provider, different model (Nemotron → GPT-OSS on NVIDIA)	Just the OpenShell gateway route	`openshell inference set --provider <p> --model <m>`	Hot — no restart
Different provider family (NVIDIA → Anthropic)	Gateway route and the baked `openclaw.json` API style inside the sandbox	`openshell inference set ... --no-verify` + `NEMOCLAW_MODEL_OVERRIDE` + `NEMOCLAW_INFERENCE_API_OVERRIDE` + `nemoclaw onboard --resume --recreate-sandbox`	Recreate — sandbox must bounce

Why the recreate? OpenClaw itself needs to know whether to speak openai-completions or anthropic-messages. That lives in its baked config; the entrypoint patches openclaw.json at container startup from the override env vars, so you have to bounce the sandbox to pick up the new values. NEMOCLAW_INFERENCE_API_OVERRIDE accepts:

openai-completions — for NVIDIA, OpenAI, Gemini, and OpenAI-compatible endpoints
anthropic-messages — for Anthropic and Anthropic-compatible endpoints

The provider matrix

Provider	Status	Endpoint type	Env var	Curated models
NVIDIA Endpoints (default)	Tested	OpenAI-compat	`NVIDIA_API_KEY`	Nemotron 3 Super 120B, Kimi K2.5, GLM-5, MiniMax M2.5, GPT-OSS 120B — all on `integrate.api.nvidia.com`
OpenAI	Tested	Native OpenAI	`OPENAI_API_KEY`	`gpt-5.4`, `gpt-5.4-mini`, `gpt-5.4-nano`, `gpt-5.4-pro-2026-03-05`
Anthropic	Tested	Native Anthropic	`ANTHROPIC_API_KEY`	`claude-sonnet-4-6`, `claude-haiku-4-5`, `claude-opus-4-6`
Google Gemini	Tested	OpenAI-compat	`GEMINI_API_KEY`	`gemini-3.1-pro-preview`, `gemini-3.1-flash-lite-preview`, `gemini-3-flash-preview`, `gemini-2.5-{pro,flash,flash-lite}`
OpenAI-compatible	Tested	Custom OpenAI-compat	`COMPATIBLE_API_KEY`	You provide the model. Works with OpenRouter, LocalAI, llama.cpp, vLLM via this path, etc.
Anthropic-compatible	Tested	Custom Anthropic	`COMPATIBLE_ANTHROPIC_API_KEY`	You provide the model. For Claude proxies and compatible gateways.
Local Ollama	Caveated	Local Ollama API	n/a	Detected from local install; NemoClaw can install via Homebrew on macOS
Local NVIDIA NIM	Experimental	Local OpenAI-compat	`NEMOCLAW_EXPERIMENTAL=1`	Filtered by GPU VRAM; NemoClaw pulls and manages the container
Local vLLM	Experimental	Local OpenAI-compat	`NEMOCLAW_EXPERIMENTAL=1`	Server must already be running on `localhost:8000`

Two things the wizard does that are easy to miss:

Validation. Before creating the sandbox, NemoClaw sends a real test completion to the selected endpoint and model. If it fails, the wizard loops back to provider selection. Your first hint that a credential or base URL is wrong is right here, not at runtime.
Tool-calling probe. For OpenAI-compat and Gemini, NemoClaw only prefers the newer /v1/responses path if the probe proves the endpoint emits tool calls in a shape OpenClaw understands. Otherwise it silently falls back to /v1/chat/completions. You don't have to care — just know that's what's happening.

What the baseline network policy already allows

From openclaw-sandbox.yaml (see doc 04), the baseline nvidia policy block already allows POST /v1/chat/completions, POST /v1/completions, POST /v1/embeddings, GET /v1/models, GET /v1/models/** on integrate.api.nvidia.com and inference-api.nvidia.com — pinned to /usr/local/bin/openclaw and /usr/local/bin/claude. That's why NVIDIA is the default: the out-of-the-box policy already works without touching anything. Other cloud providers (OpenAI, Anthropic, Gemini) need their hosts added via presets or baseline edits before egress to them will succeed.

Runtime switching — same provider, different model

Hot swap via OpenShell. Examples:

# NVIDIA — pick a different Nemotron / third-party hosted model
openshell inference set --provider nvidia-prod --model nvidia/nemotron-3-super-120b-a12b

# OpenAI
openshell inference set --provider openai-api --model gpt-5.4

# Anthropic
openshell inference set --provider anthropic-prod --model claude-sonnet-4-6

# Gemini
openshell inference set --provider gemini-api --model gemini-2.5-flash

# Your custom compatible endpoint
openshell inference set --provider compatible-endpoint --model <model-name>
openshell inference set --provider compatible-anthropic-endpoint --model <model-name>

Verify:

nemoclaw <name> status              # active provider, model, endpoint
nemoclaw <name> status --json       # machine-readable

Runtime switching — cross-provider family

Two-phase operation. Say you're moving from NVIDIA → Anthropic:

# 1. Set the OpenShell gateway route on the host (skip validation since
#    the sandbox isn't yet configured to speak Anthropic-messages)
openshell inference set --provider anthropic-prod --model claude-sonnet-4-6 --no-verify

# 2. Set override env vars + recreate the sandbox
export NEMOCLAW_MODEL_OVERRIDE="anthropic/claude-sonnet-4-6"
export NEMOCLAW_INFERENCE_API_OVERRIDE="anthropic-messages"
nemoclaw onboard --resume --recreate-sandbox

The sandbox entrypoint picks up the overrides at startup and patches /sandbox/.openclaw/openclaw.json in place. No image rebuild — the same image boots into a different API style.

To revert: unset both env vars and nemoclaw onboard --resume --recreate-sandbox again.

Local inference — the four paths

Path 1: Ollama (the default local option)

The onboard wizard auto-detects Ollama if it's installed or running. On macOS it'll even offer to install it via Homebrew. If nothing's installed, it suggests starter models; otherwise it lists your existing ones, pulls whichever you pick, warms it, validates it.

nemoclaw onboard
# → select "Local Ollama", pick a model

The Linux + Docker Ollama gotcha — read this: when NemoClaw runs under Docker on Linux, the sandbox reaches Ollama via http://host.openshell.internal:11434, not localhost. If your Ollama binds 127.0.0.1 (Ollama's default), host-side detection during onboard passes but sandbox-side validation fails because containers can't reach it. Fix:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Non-interactive:

NEMOCLAW_PROVIDER=ollama \
NEMOCLAW_MODEL=qwen2.5:14b \
nemoclaw onboard --non-interactive

NEMOCLAW_MODEL is optional — if omitted, NemoClaw picks a default based on available memory.

Path 2: Any OpenAI-compatible server

Works with vLLM, TensorRT-LLM, llama.cpp, LocalAI, OpenRouter, anything that implements /v1/chat/completions. Start your server, then onboard:

# Example with vLLM
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

nemoclaw onboard
# → select "Other OpenAI-compatible endpoint"
# → base URL: http://localhost:8000/v1
# → API key: any non-empty string (e.g. "dummy") if no auth

Non-interactive:

NEMOCLAW_PROVIDER=custom \
NEMOCLAW_ENDPOINT_URL=http://localhost:8000/v1 \
NEMOCLAW_MODEL=meta-llama/Llama-3.1-8B-Instruct \
COMPATIBLE_API_KEY=dummy \
nemoclaw onboard --non-interactive

Path 3: Any Anthropic-compatible server

Same idea but for servers implementing /v1/messages:

nemoclaw onboard
# → select "Other Anthropic-compatible endpoint"

Non-interactive:

NEMOCLAW_PROVIDER=anthropicCompatible \
NEMOCLAW_ENDPOINT_URL=http://localhost:8080 \
NEMOCLAW_MODEL=my-model \
COMPATIBLE_ANTHROPIC_API_KEY=dummy \
nemoclaw onboard --non-interactive

Path 4: vLLM or NIM auto-detection (experimental)

Gated behind NEMOCLAW_EXPERIMENTAL=1.

vLLM — if already running on localhost:8000, NemoClaw queries /v1/models and uses whatever is loaded:

NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard
# → "Local vLLM [experimental]"

NIM — on NIM-capable GPUs, NemoClaw will pull, start, and manage the NIM container itself, filtering models by your GPU's VRAM:

NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard
# → "Local NVIDIA NIM [experimental]"

Shared caveat: NemoClaw forces /v1/chat/completions for both vLLM and NIM — neither runs the --tool-call-parser on /v1/responses, so tool calls would come back as raw text. This is a known workaround, not a bug.

Timeouts for local inference

Default is 180 seconds for Ollama/vLLM/NIM (vs shorter for cloud APIs) because DGX Spark on big prompts needs the headroom. This is baked into the sandbox at build time — changing it needs a re-onboard:

export NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300
nemoclaw onboard

Where this lands in the rest of the doc set

The nvidia network policy block that makes the default inference route work out of the box is in 04-policies-and-guardrails.md.
The blueprint inference profiles (default, ncp, nim-local, vllm) that NemoClaw uses as pre-built templates are described in 05-nemoclaw.md under "What the blueprint actually says".
The general openshell inference set + rerun-onboard rule is in 03-command-map.md. This doc is the detailed version of that row.

The mental model stays the same as doc 02: same-provider model switches stay at the OpenShell layer and are hot; cross-provider switches cross into OpenClaw's baked config and need a sandbox recreate. Doc 04's static-vs-hot-reloadable distinction still holds — network and inference layers are dynamic, everything else (including the OpenClaw-side API style preference) is static at the sandbox image level.

← Back to 00-INDEX.md