SYLEN
AboutNewsConferenceMembershipDonate

Email updates

Conference, news, and membership updates by email.

Site

  • About
  • News
  • Membership
  • Waitlist
  • Donate

Conference

  • Conference 2027
  • Call for papers

Account

  • Create account
  • Membership details

SYLEN

  • Guidelines
  • Privacy
  • Terms

© 2026 Systems Leadership and Engineering Network. sylen.org.

Membership details →
Back to news
Systems ArchitectureSource: github.comJune 2, 2026

Bypassing GeForce Driver Restrictions: `nbd-vram` Leverages NBD and CUDA to Turn Idle VRAM into High-Priority Swap

The open-source `nbd-vram` daemon exposes idle NVIDIA GPU memory as a Linux block device using the Network Block Device protocol and the CUDA driver API. By sidestepping hardware-enforced P2P restrictions on consumer GPUs, the tool provides a low-latency swap target ideal for memory-constrained hybrid laptops.

Architecture and Data Path

The `nbd-vram` utility implements a userspace daemon that allocates video memory (VRAM) via the CUDA driver API (`libcuda.so.1`) and exposes it as a Linux swap device. It routes block traffic through the kernel's native Network Block Device (NBD) module over a Unix socket. This design avoids the need for proprietary kernel modules or custom kernel symbols, allowing the implementation to survive kernel and driver updates without requiring recompiles.

The execution data path traverses several abstraction layers. When the kernel swap subsystem issues an I/O request, it targets the designated NBD block device (`/dev/nbdX`). The kernel's built-in `nbd` driver forwards the transaction payload over a local Unix socket to the `nbd-vram` daemon. The daemon receives the payload and issues `cuMemcpyHtoD` (Host to Device) for writes or `cuMemcpyDtoH` (Device to Host) for reads to commit or retrieve the swap pages from the physical VRAM.

The Barrier to Direct P2P Access

The implementation of `nbd-vram` is explicitly structured to bypass hardcoded driver-level locks on consumer-grade NVIDIA GeForce hardware. The standard approach for mapping VRAM directly to system memory involves invoking the `nvidia_p2p_get_pages_persistent` API, which pins target VRAM pages in BAR1 to allow direct CPU access via write-combining physical maps (`ioremap_wc`). On GeForce GPUs, this interface returns `EINVAL`, as the NVIDIA Resource Manager restricts physical address space pinning to enterprise-class Quadro and datacenter SKUs.

Attempts to bypass the P2P API by directly mapping the physical BAR1 address space with `ioremap_wc` also fail. On consumer cards, the GPU's internal page tables restrict the mapped BAR1 space to roughly 16 MiB—just enough for the display framebuffer. Reading outside this window yields silent failures that return only zeros. Consequently, while standard tooling like `mkswap` appears to succeed by writing to the initial mapped area, subsequent `swapon` calls fail because the necessary swap headers cannot be read back. Utilizing standard CUDA memory copy APIs via an NBD socket bypasses these BAR1 mapping constraints entirely.

Performance Profiles and Overhead

Synthetic benchmarks reveal distinct trade-offs between raw throughput and random I/O latency when comparing `nbd-vram` to a PCIe 4.0 NVMe drive utilizing `dm-crypt` cryptswap. Under sequential loads, a 2 GiB direct I/O (`O_DIRECT`) write test drops from 2.7 GB/s on NVMe to 1.1 GB/s on VRAM. Sequential reads drop from 2.9 GB/s on NVMe to 2.3 GB/s on VRAM. This throughput penalty is directly attributable to the serialization overhead of passing blocks across the Unix socket and processing user-to-kernel context shifts for each CUDA memory copy operation.

For random I/O workloads with high queue depths, NVMe maintains an advantage due to hardware-level parallel execution. A random 4K I/O benchmark using `fio` at an I/O depth of 32 demonstrates the following results:

  • NVMe cryptswap: 45.4k Read IOPS, 45.3k Write IOPS, 343 us average latency
  • VRAM NBD: 28.7k Read IOPS, 28.7k Write IOPS, 550 us average latency

Because the `nbd-vram` daemon serializes all incoming block requests before submitting them to CUDA, the system cannot take advantage of high parallel queue depths.

Power State Anomalies and Real-World Latency

While NVMe dominates high-throughput parallel workloads, the performance profile shifts during sporadic, low-frequency page faults. In a 4K read benchmark executing one request per second—a pattern typical of passive system swap usage—the average latency of NVMe spikes to 9.05 ms, with maximums hitting 10.1 ms. This latency degradation is caused by Autonomous Power State Transitions (APST), which continually transition the NVMe controller into low-power sleep states during idle periods.

In contrast, `nbd-vram` processes single-request latency runs consistently between 133 us and 490 us, averaging 335 us. Because the GPU VRAM has no equivalent low-power state latency penalties under this access pattern, it responds nearly 27 times faster than the sleeping NVMe drive. To preserve laptop battery life when high performance is not required, the daemon includes a power-aware utility that automatically stops the swap service when the system is unplugged from AC power or when the battery charge falls below a designated threshold.

Read the original article at github.com.