08 · Inference Options — Providers, Models, and Runtime Switching
Source: the
nemoclaw-user-configure-inferenceskill. This doc is about which model answers your agent — how NemoClaw presents providers at onboarding, how runtime switching actually works, and the gotchas for local model servers.
The one thing to internalize first
Inside the sandbox, the agent always talks to
inference.local. It never sees real URLs or real credentials. Everything below is about what that host-side label routes to.
This means "switching models" has two flavors, and they are not interchangeable:
| Kind of switch | What changes | How you do it | Lifecycle |
|---|---|---|---|
| Same provider, different model (Nemotron → GPT-OSS on NVIDIA) | Just the OpenShell gateway route | openshell inference set --provider <p> --model <m> |
Hot — no restart |
| Different provider family (NVIDIA → Anthropic) | Gateway route and the baked openclaw.json API style inside the sandbox |
openshell inference set ... --no-verify + NEMOCLAW_MODEL_OVERRIDE + NEMOCLAW_INFERENCE_API_OVERRIDE + nemoclaw onboard --resume --recreate-sandbox |
Recreate — sandbox must bounce |
Why the recreate? OpenClaw itself needs to know whether to speak openai-completions or anthropic-messages. That lives in its baked config; the entrypoint patches openclaw.json at container startup from the override env vars, so you have to bounce the sandbox to pick up the new values. NEMOCLAW_INFERENCE_API_OVERRIDE accepts:
openai-completions— for NVIDIA, OpenAI, Gemini, and OpenAI-compatible endpointsanthropic-messages— for Anthropic and Anthropic-compatible endpoints
The provider matrix
| Provider | Status | Endpoint type | Env var | Curated models |
|---|---|---|---|---|
| NVIDIA Endpoints (default) | Tested | OpenAI-compat | NVIDIA_API_KEY |
Nemotron 3 Super 120B, Kimi K2.5, GLM-5, MiniMax M2.5, GPT-OSS 120B — all on integrate.api.nvidia.com |
| OpenAI | Tested | Native OpenAI | OPENAI_API_KEY |
gpt-5.4, gpt-5.4-mini, gpt-5.4-nano, gpt-5.4-pro-2026-03-05 |
| Anthropic | Tested | Native Anthropic | ANTHROPIC_API_KEY |
claude-sonnet-4-6, claude-haiku-4-5, claude-opus-4-6 |
| Google Gemini | Tested | OpenAI-compat | GEMINI_API_KEY |
gemini-3.1-pro-preview, gemini-3.1-flash-lite-preview, gemini-3-flash-preview, gemini-2.5-{pro,flash,flash-lite} |
| OpenAI-compatible | Tested | Custom OpenAI-compat | COMPATIBLE_API_KEY |
You provide the model. Works with OpenRouter, LocalAI, llama.cpp, vLLM via this path, etc. |
| Anthropic-compatible | Tested | Custom Anthropic | COMPATIBLE_ANTHROPIC_API_KEY |
You provide the model. For Claude proxies and compatible gateways. |
| Local Ollama | Caveated | Local Ollama API | n/a | Detected from local install; NemoClaw can install via Homebrew on macOS |
| Local NVIDIA NIM | Experimental | Local OpenAI-compat | NEMOCLAW_EXPERIMENTAL=1 |
Filtered by GPU VRAM; NemoClaw pulls and manages the container |
| Local vLLM | Experimental | Local OpenAI-compat | NEMOCLAW_EXPERIMENTAL=1 |
Server must already be running on localhost:8000 |
Two things the wizard does that are easy to miss:
- Validation. Before creating the sandbox, NemoClaw sends a real test completion to the selected endpoint and model. If it fails, the wizard loops back to provider selection. Your first hint that a credential or base URL is wrong is right here, not at runtime.
- Tool-calling probe. For OpenAI-compat and Gemini, NemoClaw only prefers the newer
/v1/responsespath if the probe proves the endpoint emits tool calls in a shape OpenClaw understands. Otherwise it silently falls back to/v1/chat/completions. You don't have to care — just know that's what's happening.
What the baseline network policy already allows
From openclaw-sandbox.yaml (see doc 04), the baseline nvidia policy block already allows POST /v1/chat/completions, POST /v1/completions, POST /v1/embeddings, GET /v1/models, GET /v1/models/** on integrate.api.nvidia.com and inference-api.nvidia.com — pinned to /usr/local/bin/openclaw and /usr/local/bin/claude. That's why NVIDIA is the default: the out-of-the-box policy already works without touching anything. Other cloud providers (OpenAI, Anthropic, Gemini) need their hosts added via presets or baseline edits before egress to them will succeed.
Runtime switching — same provider, different model
Hot swap via OpenShell. Examples:
# NVIDIA — pick a different Nemotron / third-party hosted model
openshell inference set --provider nvidia-prod --model nvidia/nemotron-3-super-120b-a12b
# OpenAI
openshell inference set --provider openai-api --model gpt-5.4
# Anthropic
openshell inference set --provider anthropic-prod --model claude-sonnet-4-6
# Gemini
openshell inference set --provider gemini-api --model gemini-2.5-flash
# Your custom compatible endpoint
openshell inference set --provider compatible-endpoint --model <model-name>
openshell inference set --provider compatible-anthropic-endpoint --model <model-name>
Verify:
nemoclaw <name> status # active provider, model, endpoint
nemoclaw <name> status --json # machine-readable
Runtime switching — cross-provider family
Two-phase operation. Say you're moving from NVIDIA → Anthropic:
# 1. Set the OpenShell gateway route on the host (skip validation since
# the sandbox isn't yet configured to speak Anthropic-messages)
openshell inference set --provider anthropic-prod --model claude-sonnet-4-6 --no-verify
# 2. Set override env vars + recreate the sandbox
export NEMOCLAW_MODEL_OVERRIDE="anthropic/claude-sonnet-4-6"
export NEMOCLAW_INFERENCE_API_OVERRIDE="anthropic-messages"
nemoclaw onboard --resume --recreate-sandbox
The sandbox entrypoint picks up the overrides at startup and patches /sandbox/.openclaw/openclaw.json in place. No image rebuild — the same image boots into a different API style.
To revert: unset both env vars and nemoclaw onboard --resume --recreate-sandbox again.
Local inference — the four paths
Path 1: Ollama (the default local option)
The onboard wizard auto-detects Ollama if it's installed or running. On macOS it'll even offer to install it via Homebrew. If nothing's installed, it suggests starter models; otherwise it lists your existing ones, pulls whichever you pick, warms it, validates it.
nemoclaw onboard
# → select "Local Ollama", pick a model
The Linux + Docker Ollama gotcha — read this: when NemoClaw runs under Docker on Linux, the sandbox reaches Ollama via http://host.openshell.internal:11434, not localhost. If your Ollama binds 127.0.0.1 (Ollama's default), host-side detection during onboard passes but sandbox-side validation fails because containers can't reach it. Fix:
OLLAMA_HOST=0.0.0.0:11434 ollama serve
Non-interactive:
NEMOCLAW_PROVIDER=ollama \
NEMOCLAW_MODEL=qwen2.5:14b \
nemoclaw onboard --non-interactive
NEMOCLAW_MODEL is optional — if omitted, NemoClaw picks a default based on available memory.
Path 2: Any OpenAI-compatible server
Works with vLLM, TensorRT-LLM, llama.cpp, LocalAI, OpenRouter, anything that implements /v1/chat/completions. Start your server, then onboard:
# Example with vLLM
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
nemoclaw onboard
# → select "Other OpenAI-compatible endpoint"
# → base URL: http://localhost:8000/v1
# → API key: any non-empty string (e.g. "dummy") if no auth
Non-interactive:
NEMOCLAW_PROVIDER=custom \
NEMOCLAW_ENDPOINT_URL=http://localhost:8000/v1 \
NEMOCLAW_MODEL=meta-llama/Llama-3.1-8B-Instruct \
COMPATIBLE_API_KEY=dummy \
nemoclaw onboard --non-interactive
Path 3: Any Anthropic-compatible server
Same idea but for servers implementing /v1/messages:
nemoclaw onboard
# → select "Other Anthropic-compatible endpoint"
Non-interactive:
NEMOCLAW_PROVIDER=anthropicCompatible \
NEMOCLAW_ENDPOINT_URL=http://localhost:8080 \
NEMOCLAW_MODEL=my-model \
COMPATIBLE_ANTHROPIC_API_KEY=dummy \
nemoclaw onboard --non-interactive
Path 4: vLLM or NIM auto-detection (experimental)
Gated behind NEMOCLAW_EXPERIMENTAL=1.
vLLM — if already running on localhost:8000, NemoClaw queries /v1/models and uses whatever is loaded:
NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard
# → "Local vLLM [experimental]"
NIM — on NIM-capable GPUs, NemoClaw will pull, start, and manage the NIM container itself, filtering models by your GPU's VRAM:
NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard
# → "Local NVIDIA NIM [experimental]"
Shared caveat: NemoClaw forces /v1/chat/completions for both vLLM and NIM — neither runs the --tool-call-parser on /v1/responses, so tool calls would come back as raw text. This is a known workaround, not a bug.
Timeouts for local inference
Default is 180 seconds for Ollama/vLLM/NIM (vs shorter for cloud APIs) because DGX Spark on big prompts needs the headroom. This is baked into the sandbox at build time — changing it needs a re-onboard:
export NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300
nemoclaw onboard
Where this lands in the rest of the doc set
- The
nvidianetwork policy block that makes the default inference route work out of the box is in04-policies-and-guardrails.md. - The blueprint inference profiles (
default,ncp,nim-local,vllm) that NemoClaw uses as pre-built templates are described in05-nemoclaw.mdunder "What the blueprint actually says". - The general
openshell inference set+ rerun-onboard rule is in03-command-map.md. This doc is the detailed version of that row.
The mental model stays the same as doc 02: same-provider model switches stay at the OpenShell layer and are hot; cross-provider switches cross into OpenClaw's baked config and need a sandbox recreate. Doc 04's static-vs-hot-reloadable distinction still holds — network and inference layers are dynamic, everything else (including the OpenClaw-side API style preference) is static at the sandbox image level.
← Back to 00-INDEX.md