Understanding AI Memory

The Understanding AI Memory series examines how memory behaves while AI models run and how this affects platform design choices. The articles break down the main runtime memory components that shape capacity planning, concurrency, and infrastructure design.

Series overview

4 parts • Latest: Understanding Unified Memory on DGX Spark Running NemoClaw and Nemotron

Part 1 The Dynamic World of LLM Runtime Memory
Explains how KV cache and context length drive LLM runtime memory growth and how this determines predictable GPU concurrency during inference workloads.
Part 2 Understanding Activation Memory in Mixture of Experts Models
Explains how activation memory behaves in Mixture of Experts models and why long-context and agentic inference introduce unpredictable activation peaks during prefill phases.
Part 3 Durable Agentic AI Sessions in GPU Memory
How agentic AI workloads accumulate KV cache across reasoning steps and tool calls and why this changes GPU memory planning for on-prem infrastructure.
Part 4 Understanding Unified Memory on DGX Spark Running NemoClaw and NemotronNewest
NemoClaw became the talk of GTC 2026 within hours of its announcement. It wraps OpenClaw in NVIDIA’s OpenShell runtime, adds guardrails, and gives you an always on AI agent with a single install. Jensen Huang called …