Understanding AI Memory
The Understanding AI Memory series examines how memory behaves while AI models run and how this affects platform design choices. The articles break down the main runtime memory components that shape capacity planning, concurrency, and infrastructure design.
Series overview
4 parts • Latest:
Understanding Unified Memory on DGX Spark Running NemoClaw and Nemotron
- Explains how KV cache and context length drive LLM runtime memory growth and how this determines predictable GPU concurrency during inference workloads.
- Explains how activation memory behaves in Mixture of Experts models and why long-context and agentic inference introduce unpredictable activation peaks during prefill phases.
- How agentic AI workloads accumulate KV cache across reasoning steps and tool calls and why this changes GPU memory planning for on-prem infrastructure.
- NemoClaw became the talk of GTC 2026 within hours of its announcement. It wraps OpenClaw in NVIDIA’s OpenShell runtime, adds guardrails, and gives you an always on AI agent with a single install. Jensen Huang called …