Understanding Unified Memory on DGX Spark Running NemoClaw and Nemotron

NemoClaw became the talk of GTC 2026 within hours of its announcement. It wraps OpenClaw in NVIDIA’s OpenShell runtime, adds guardrails, and gives you an always on AI agent with a single install. Jensen Huang called OpenClaw the operating system for personal AI. NemoClaw is what makes that usable.

This is part 4 of the AI Memory series and focuses on how memory behaves on real systems.

I installed NemoClaw on a DGX Spark and ran Nemotron models locally to understand what actually happens in memory. The most important takeaway is simple. Unified memory breaks the usual GPU mental model.

On a traditional system, the GPU has its own memory, tools like nvidia smi show usage, and free memory roughly maps to what you can still use. On DGX Spark, CPU and GPU share one memory pool. The signals you are used to no longer tell the full story.

The models used by NemoClaw are Mixture of Experts models. Dense models activate all parameters for every token, so memory and compute scale together. MoE models behave differently. All parameters must be present in memory, but only a subset is used per token. That creates two separate budgets. Total parameters define the memory footprint. Active parameters define the compute cost. The earlier posts in this series explain this in detail:

Part 1 - The Dynamic World of LLM Runtime Memory
Part 2 - Understanding Activation Memory in Mixture of Experts Models
Part 3 - Durable Agentic AI Sessions in GPU Memory

On a unified memory system, this separation becomes very visible. The model either fits based on total parameters or it does not. Everything that matters operationally depends on what memory remains after that.

Installing NemoClaw and selecting a model

After following the DGX Spark NemoClaw playbook, the installer detected Ollama and suggested running locally

Inference options:
  1) NVIDIA Endpoint API (build.nvidia.com)
  2) Local Ollama (localhost:11434) — running (suggested)

The default model is Nemotron 3 Nano. The larger Nemotron 3 Super is available but not selected by default. That choice already hints at what matters on this system. Not just whether a model fits, but how much room is left after it does.

Loading Nano

frankdenneman@spark:~$ ollama ps
NAME                   ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
nemotron-3-nano:30b    b725f1117407    27 GB    100% GPU     262144     4 minutes from now

Nano downloads as 24 GB and becomes 27 GB in memory. The difference comes from decompression and preallocated context buffers. Ollama reserves space for the full context window up front.

frankdenneman@spark:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           121Gi        32Gi        66Gi        56Mi        23Gi        89Gi
Swap:           15Gi       120Ki        15Gi

About 31 GB is in use and 89 GiB remains available. All model parameters are resident, which defines the memory footprint, and the remaining space is what you can use for everything else.

Loading Super

frankdenneman@spark:~$ ollama ps
NAME                     SIZE     PROCESSOR    CONTEXT
nemotron-3-super:120b    94 GB    100% GPU     262144

Super downloads as 86 GB and becomes 94 GB in memory.

frankdenneman@spark:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           121Gi        94Gi       861Mi        56Mi        27Gi        27Gi
Swap:           15Gi       120Ki        15Gi

At first glance this looks like the system is out of memory, but it is not. The available column shows 27 GiB of usable headroom. All parameters are loaded and ready, and what remains is the space available for runtime behavior.

Reading memory on DGX Spark

On DGX Spark there is no separate VRAM. CPU and GPU share one memory pool, which changes how you read the system. The nvidia smi memory gauge is not useful here, and the real signal comes from Linux.

frankdenneman@spark:~$ nvidia-smi
Mon Mar 23 15:56:03 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.142                Driver Version: 580.142        CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   37C    P0             10W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2523      G   /usr/lib/xorg/Xorg                       43MiB |
|    0   N/A  N/A            2950      G   /usr/bin/gnome-shell                     16MiB |
|    0   N/A  N/A          881931      C   /usr/local/bin/ollama                 89709MiB |
+-----------------------------------------------------------------------------------------+

The usual memory bar is not available. The process view still shows allocations, but it does not reflect total system headroom. For that, you need to look at the OS.

frankdenneman@spark:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           121Gi        94Gi       861Mi        56Mi        27Gi        27Gi
Swap:           15Gi       120Ki        15Gi

Free memory looks low because Linux uses spare memory as page cache, holding recently read data such as the model file that was just loaded. That memory is not locked and is reclaimed on demand.

$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

After dropping cache, free memory jumps to match available. Nothing changed for the model because the weights were already in CUDA managed memory. The key mental shift is that available memory is your real headroom, while free memory is simply what is unused at that moment.

frankdenneman@spark:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           121Gi        93Gi        28Gi        56Mi       1.0Gi        28Gi
Swap:           15Gi       120Ki        15Gi

What headroom really means

Headroom determines what you can actually run. The context window you see in ollama ps is allocated per parallel request. Each additional slot requires its own memory reservation. For Super at 262K context, that is roughly 7 GB per slot.

With about 27 GiB of headroom, running a single agent is straightforward. Running multiple agents is possible, but each additional slot reduces the margin for activation spikes and OS overhead.

Nano leaves about 89 GiB of headroom, which allows multiple agents, larger context windows, and more flexibility. This is why the installer defaults to Nano. Not because Super cannot run, but because Nano leaves room for actual usage.

Observing behavior under load

I ran a sustained agent workload for 27 minutes and monitored memory. Available memory stayed stable at around 27 GiB. Free memory fluctuated as the kernel reclaimed and reused page cache.

long-running-agent

This is expected behavior. The kernel does not keep large amounts of memory unused. It reclaims what it needs when it needs it. The system never touched swap.

Swap on unified memory

Swap is disk space, not part of the memory pool. It does not increase headroom. On unified memory systems, swap introduces a different failure mode. If the kernel pages out model data and that data is needed again, inference stalls. With MoE models, where routing is dynamic, this can happen unpredictably. If swap usage increases, performance is already degraded.

What to take away

Unified memory changes how you read GPU systems. The model fitting into memory is only the starting point. What matters is the headroom that remains.

Use free h and focus on available memory. Do not rely on free memory alone. Do not rely on nvidia smi for capacity planning. Treat swap usage as a signal that you are beyond safe operating conditions.

Most importantly, think in terms of headroom. That is what determines how many agents you can run, how much context you can support, and how stable the system will be under load.

Posts in this series

Part 1 - The Dynamic World of LLM Runtime Memory
Part 2 - Understanding Activation Memory in Mixture of Experts Models
Part 3 - Durable Agentic AI Sessions in GPU Memory
Part 4 - Understanding Unified Memory on DGX Spark Running NemoClaw and Nemotron

ai dgx-spark