Learning NemoClaw

08 · Inference Options — Providers, Models, and Runtime Switching

Source: the nemoclaw-user-configure-inference skill. This doc is about which model answers your agent — how NemoClaw presents providers at onboarding, how runtime switching actually works, and the gotchas for local model servers.

The one thing to internalize first

Inside the sandbox, the agent always talks to inference.local. It never sees real URLs or real credentials. Everything below is about what that host-side label routes to.

This means "switching models" has two flavors, and they are not interchangeable:

Kind of switch What changes How you do it Lifecycle
Same provider, different model (Nemotron → GPT-OSS on NVIDIA) Just the OpenShell gateway route openshell inference set --provider <p> --model <m> Hot — no restart
Different provider family (NVIDIA → Anthropic) Gateway route and the baked openclaw.json API style inside the sandbox openshell inference set ... --no-verify + NEMOCLAW_MODEL_OVERRIDE + NEMOCLAW_INFERENCE_API_OVERRIDE + nemoclaw onboard --resume --recreate-sandbox Recreate — sandbox must bounce

Why the recreate? OpenClaw itself needs to know whether to speak openai-completions or anthropic-messages. That lives in its baked config; the entrypoint patches openclaw.json at container startup from the override env vars, so you have to bounce the sandbox to pick up the new values. NEMOCLAW_INFERENCE_API_OVERRIDE accepts:

The provider matrix

Provider Status Endpoint type Env var Curated models
NVIDIA Endpoints (default) Tested OpenAI-compat NVIDIA_API_KEY Nemotron 3 Super 120B, Kimi K2.5, GLM-5, MiniMax M2.5, GPT-OSS 120B — all on integrate.api.nvidia.com
OpenAI Tested Native OpenAI OPENAI_API_KEY gpt-5.4, gpt-5.4-mini, gpt-5.4-nano, gpt-5.4-pro-2026-03-05
Anthropic Tested Native Anthropic ANTHROPIC_API_KEY claude-sonnet-4-6, claude-haiku-4-5, claude-opus-4-6
Google Gemini Tested OpenAI-compat GEMINI_API_KEY gemini-3.1-pro-preview, gemini-3.1-flash-lite-preview, gemini-3-flash-preview, gemini-2.5-{pro,flash,flash-lite}
OpenAI-compatible Tested Custom OpenAI-compat COMPATIBLE_API_KEY You provide the model. Works with OpenRouter, LocalAI, llama.cpp, vLLM via this path, etc.
Anthropic-compatible Tested Custom Anthropic COMPATIBLE_ANTHROPIC_API_KEY You provide the model. For Claude proxies and compatible gateways.
Local Ollama Caveated Local Ollama API n/a Detected from local install; NemoClaw can install via Homebrew on macOS
Local NVIDIA NIM Experimental Local OpenAI-compat NEMOCLAW_EXPERIMENTAL=1 Filtered by GPU VRAM; NemoClaw pulls and manages the container
Local vLLM Experimental Local OpenAI-compat NEMOCLAW_EXPERIMENTAL=1 Server must already be running on localhost:8000

Two things the wizard does that are easy to miss:

What the baseline network policy already allows

From openclaw-sandbox.yaml (see doc 04), the baseline nvidia policy block already allows POST /v1/chat/completions, POST /v1/completions, POST /v1/embeddings, GET /v1/models, GET /v1/models/** on integrate.api.nvidia.com and inference-api.nvidia.com — pinned to /usr/local/bin/openclaw and /usr/local/bin/claude. That's why NVIDIA is the default: the out-of-the-box policy already works without touching anything. Other cloud providers (OpenAI, Anthropic, Gemini) need their hosts added via presets or baseline edits before egress to them will succeed.

Runtime switching — same provider, different model

Hot swap via OpenShell. Examples:

# NVIDIA — pick a different Nemotron / third-party hosted model
openshell inference set --provider nvidia-prod --model nvidia/nemotron-3-super-120b-a12b

# OpenAI
openshell inference set --provider openai-api --model gpt-5.4

# Anthropic
openshell inference set --provider anthropic-prod --model claude-sonnet-4-6

# Gemini
openshell inference set --provider gemini-api --model gemini-2.5-flash

# Your custom compatible endpoint
openshell inference set --provider compatible-endpoint --model <model-name>
openshell inference set --provider compatible-anthropic-endpoint --model <model-name>

Verify:

nemoclaw <name> status              # active provider, model, endpoint
nemoclaw <name> status --json       # machine-readable

Runtime switching — cross-provider family

Two-phase operation. Say you're moving from NVIDIA → Anthropic:

# 1. Set the OpenShell gateway route on the host (skip validation since
#    the sandbox isn't yet configured to speak Anthropic-messages)
openshell inference set --provider anthropic-prod --model claude-sonnet-4-6 --no-verify

# 2. Set override env vars + recreate the sandbox
export NEMOCLAW_MODEL_OVERRIDE="anthropic/claude-sonnet-4-6"
export NEMOCLAW_INFERENCE_API_OVERRIDE="anthropic-messages"
nemoclaw onboard --resume --recreate-sandbox

The sandbox entrypoint picks up the overrides at startup and patches /sandbox/.openclaw/openclaw.json in place. No image rebuild — the same image boots into a different API style.

To revert: unset both env vars and nemoclaw onboard --resume --recreate-sandbox again.

Local inference — the four paths

Path 1: Ollama (the default local option)

The onboard wizard auto-detects Ollama if it's installed or running. On macOS it'll even offer to install it via Homebrew. If nothing's installed, it suggests starter models; otherwise it lists your existing ones, pulls whichever you pick, warms it, validates it.

nemoclaw onboard
# → select "Local Ollama", pick a model

The Linux + Docker Ollama gotcha — read this: when NemoClaw runs under Docker on Linux, the sandbox reaches Ollama via http://host.openshell.internal:11434, not localhost. If your Ollama binds 127.0.0.1 (Ollama's default), host-side detection during onboard passes but sandbox-side validation fails because containers can't reach it. Fix:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Non-interactive:

NEMOCLAW_PROVIDER=ollama \
NEMOCLAW_MODEL=qwen2.5:14b \
nemoclaw onboard --non-interactive

NEMOCLAW_MODEL is optional — if omitted, NemoClaw picks a default based on available memory.

Path 2: Any OpenAI-compatible server

Works with vLLM, TensorRT-LLM, llama.cpp, LocalAI, OpenRouter, anything that implements /v1/chat/completions. Start your server, then onboard:

# Example with vLLM
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

nemoclaw onboard
# → select "Other OpenAI-compatible endpoint"
# → base URL: http://localhost:8000/v1
# → API key: any non-empty string (e.g. "dummy") if no auth

Non-interactive:

NEMOCLAW_PROVIDER=custom \
NEMOCLAW_ENDPOINT_URL=http://localhost:8000/v1 \
NEMOCLAW_MODEL=meta-llama/Llama-3.1-8B-Instruct \
COMPATIBLE_API_KEY=dummy \
nemoclaw onboard --non-interactive

Path 3: Any Anthropic-compatible server

Same idea but for servers implementing /v1/messages:

nemoclaw onboard
# → select "Other Anthropic-compatible endpoint"

Non-interactive:

NEMOCLAW_PROVIDER=anthropicCompatible \
NEMOCLAW_ENDPOINT_URL=http://localhost:8080 \
NEMOCLAW_MODEL=my-model \
COMPATIBLE_ANTHROPIC_API_KEY=dummy \
nemoclaw onboard --non-interactive

Path 4: vLLM or NIM auto-detection (experimental)

Gated behind NEMOCLAW_EXPERIMENTAL=1.

vLLM — if already running on localhost:8000, NemoClaw queries /v1/models and uses whatever is loaded:

NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard
# → "Local vLLM [experimental]"

NIM — on NIM-capable GPUs, NemoClaw will pull, start, and manage the NIM container itself, filtering models by your GPU's VRAM:

NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard
# → "Local NVIDIA NIM [experimental]"

Shared caveat: NemoClaw forces /v1/chat/completions for both vLLM and NIM — neither runs the --tool-call-parser on /v1/responses, so tool calls would come back as raw text. This is a known workaround, not a bug.

Timeouts for local inference

Default is 180 seconds for Ollama/vLLM/NIM (vs shorter for cloud APIs) because DGX Spark on big prompts needs the headroom. This is baked into the sandbox at build time — changing it needs a re-onboard:

export NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=300
nemoclaw onboard

Where this lands in the rest of the doc set

The mental model stays the same as doc 02: same-provider model switches stay at the OpenShell layer and are hot; cross-provider switches cross into OpenClaw's baked config and need a sandbox recreate. Doc 04's static-vs-hot-reloadable distinction still holds — network and inference layers are dynamic, everything else (including the OpenClaw-side API style preference) is static at the sandbox image level.


← Back to 00-INDEX.md