SYLEN
AboutNewsConferenceMembershipDonate

Email updates

Conference, news, and membership updates by email.

Site

  • About
  • News
  • Membership
  • Waitlist
  • Donate

Conference

  • Conference 2027
  • Call for papers

Account

  • Create account
  • Membership details

SYLEN

  • Guidelines
  • Privacy
  • Terms

© 2026 Systems Leadership and Engineering Network. sylen.org.

Membership details →
Back to news
InfrastructureSource: blog.tymscar.comMay 31, 2026

Integrating a Tesla V100 SXM2 into a Consumer PCIe Workstation for Low-Cost LLM Inference

An engineer successfully integrated a secondhand £150 NVIDIA Tesla V100 SXM2 into a consumer workstation alongside an RTX 4080 using a third-party SXM2-to-PCIe adapter. By resolving critical cooling limitations via PWM routing and aligning mismatched driver requirements on NixOS, the hybrid setup runs a 27B parameter model at 32 tokens per second.

Hardware Integration and Memory Bandwidth Dynamics

Integrating enterprise-grade accelerators into consumer systems presents formidable structural and electrical hurdles. To bypass the high cost of high-VRAM consumer cards, a Tesla V100 SXM2 16GB GPU—originally engineered for NVIDIA DGX servers and hyperscaler racks—was retrofitted into a consumer chassis alongside an RTX 4080. Since the SXM2 form factor lacks standard PCIe and display interfaces, the deployment relies on a third-party, bare PCB SXM2-to-PCIe adapter costing £50.

This hybrid configuration yields an aggregate VRAM pool of 32GB for approximately £200 in incremental hardware costs. While the RTX 4080 utilizes GDDR6X memory with 736 GB/s of bandwidth, the V100's HBM2 memory architecture operates on a 4096-bit bus delivering 900 GB/s. This legacy enterprise card outperforms modern consumer and integrated architectures in raw bandwidth, including Apple's M5 Max (614 GB/s) and matches 94% of the AMD RX 7900 XTX's bandwidth (960 GB/s) at a fraction of the cost.

Thermal Regulation and PWM Fan Conversion

The SXM2-to-PCIe adapter includes an integrated high-static-pressure fan designed for 2U server enclosures. Unregulated, the fan operates at a constant 100% duty cycle, generating a noise level of 82 dB. Because standard monitoring utilities such as nvidia-smi or Afterburner cannot interface with the adapter's proprietary fan controller, a physical hardware modification was required.

The adapter's fan utilizes a 4-pin JST PH2.0 connector with a 2.0mm pitch. To enable dynamic speed control, the tachometer and PWM pins were routed to a standard 2.54mm (0.1 inch) motherboard fan header using a custom JST PH2.0-to-2.54mm jumper cable. Once connected directly to the motherboard, the fan was throttled down to a 10% duty cycle via PWM. Under full load, the modified cooling system keeps the V100 below 50°C while eliminating the server-grade acoustic footprint.

Driver and Toolchain Alignment on NixOS

Managing a heterogeneous GPU configuration spanning two microarchitectures—Ada Lovelace (RTX 4080) and Volta (Tesla V100)—introduces driver and software compatibility bottlenecks. NVIDIA deprecated Volta support starting with driver branch 560. Consequently, the deployment requires the legacy 550.x driver branch (packaged as nvidiaPackages.legacy_535 in NixOS), which is the final branch capable of driving both architectures simultaneously.

This driver constraint ripples through the rest of the systems stack:

  • The 550.x driver restricts CUDA support to version 12.2, whereas current NixOS package repositories default to CUDA 12.6+. The resolution requires pinning an overlay to pull cudaPackages_12_2 from the NixOS 24.05 channel.
  • System stability demands pinning the Linux kernel to version 6.6 LTS, as newer kernels fail to compile with the legacy NVIDIA driver.
  • Despite the system operating strictly as a headless inference node, the X.org server (services.xserver.enable = true) must be enabled to force the initialization of the necessary NVIDIA kernel modules.

Inference Performance

The dual-GPU array runs local inference using llama.cpp to split workloads across the PCIe bus using tensor splitting (-ts 1.0,1.0). The pipeline distributes the model layers between the RTX 4080 and the Tesla V100.

The pipeline runs Qwen3.6-27B-MTP, quantized to Q5_K_M (19GB), offloading all 99 layers directly to VRAM. Performance benchmarks include:

  • Inference speed: ~32 tokens per second
  • Prompt processing speed: ~133–160 tokens per second
  • Peak V100 power draw: ~150W

This pipeline demonstrates that legacy enterprise silicon coupled with custom software pinning can deliver modern local LLM throughput comparable to commercial API endpoints.

Read the original article at blog.tymscar.com.