Dual AMD R9700 Setup on ASUS X99-E WS/USB 3.1

Running two AMD Radeon AI PRO R9700 cards (gfx1201, RDNA4, 32GB each) on an ageing ASUS X99-E WS/USB 3.1 board is possible, but the platform fights you at every step. The board predates Resizable BAR, ROCm does not distribute a current version in repositories, and the cards have a cold-boot clock-state quirk that makes inference start slow until the workload is restarted. This article documents the full setup: flashing a modded BIOS to enable ReBAR, installing the latest ROCm, and the small service required to get llama.cpp running at full speed on every boot.

Note: The BIOS flash is the only permanent change here, and it carries real risk. Everything else (ROCm, kernel parameters, the restart service) is reversible. Read the BIOS section carefully before flashing, and keep a copy of the stock BIOS for recovery. The modded BIOS file comes from the ASUS X99-E WS and USB 3.1 BIOS mods with ReBAR support thread.

TL;DR

Flash the community modded BIOS via USB BIOS FlashBack to gain ReBAR support, disable CSM and enable Above 4G Decoding, install ROCm 7.2 from AMD's repository (userspace only, no DKMS, to preserve a working kernel), and add a systemd service that restarts the llama.cpp container around a minute after boot. The restart is needed because the R9700 comes up in a low clock state after a cold boot and only ramps to full speed once the Vulkan context is re-initialised. With all of this in place, a 35GB Q8 model split across both cards runs at roughly 78 tokens per second.

The Hardware

Board: ASUS X99-E WS/USB 3.1, latest stock BIOS 4001 (no official ReBAR)
CPU: Intel i7-6900K (Broadwell-E)
GPUs: two AMD Radeon AI PRO R9700 (Navi 48, gfx1201, 32GB GDDR6 each)
OS: Fedora 44 Server Edition
The two R9700s each sit behind their own onboard PCIe switch and negotiate PCIe 3.0 x16. This is an old platform: it has no native Resizable BAR, and its high MMIO window is small, which becomes important later. The OS is Fedora Server Edition specifically, which matters in Part 2: unlike the Workstation edition, Server does not include the amdgpu kernel driver by default, so it has to be installed before anything else works.

Part 1: The BIOS Patch (Resizable BAR)

The stock 4001 BIOS cannot expose Resizable BAR at all. A community modder built a patched BIOS for this exact board that inserts the ReBarDxe module from the ReBarUEFI project. The modded file comes from the Win-Raid / Level1Techs thread here:

ASUS X99-E WS and USB 3.1 BIOS mods with ReBAR support

For the USB 3.1 variant of the board, the file is X99EU31_Complete.rar, which contains X99EU31.CAP.

Step 1: Back up the stock BIOS

Before flashing anything, download the stock 4001 BIOS from ASUS's support page for the board and keep it somewhere safe. If the modded BIOS misbehaves, this is the file you flash back to recover.

Step 2: Prepare the USB stick

The standard EZ Flash utility inside the BIOS rejects modded files, so the flash has to go through USB BIOS FlashBack, which does not validate the image and works even if the board will not POST.

Format a USB stick as FAT32.
Extract X99EU31.CAP from the archive and place it in the root of the stick. The filename must be exactly X99EU31.CAP for FlashBack to recognise it.

Step 3: Flash via USB BIOS FlashBack

Plug the stick into the dedicated FlashBack USB port. On this board it is the bottom blue USB 3.0 port in the leftmost stack, directly next to the glowing BIOS FlashBack button. The red ports next to it are eSATA, not USB.
With the system powered off (PSU still switched on), hold the BIOS FlashBack button for around three seconds until the LED starts blinking.
Wait for the LED to finish blinking and switch off. Do not interrupt it. The flash is complete when the light stops.

Step 4: BIOS settings after flashing

Enter the BIOS and set:

CSM (Compatibility Support Module): Disabled. Required. Leaving CSM on results in a black screen.
Above 4G Decoding: Enabled. Required for large BARs.
Secure Boot: Disabled. With CSM off, the board enforces Secure Boot, which will block an unsigned or custom kernel from booting. Disabling it avoids that.
The BIOS version still reports 4001 after the flash. This is expected: the modder kept the original version string. It is not a sign the flash failed.

A note on what ReBAR actually gets you here

ReBAR works after this flash, which you can confirm because a card's BAR can now be resized at all. But the X99/Broadwell-E platform has a small high MMIO window, and it cannot fit two full 32GB BARs at once. Attempting it produces can't assign; no space in the kernel log and a fall back to a 256MB BAR. This is a hard platform limit, not a misconfiguration, and no kernel parameter works around it. The BIOS mod is still worth doing because it enables ReBAR capability for single large-BAR cases, and it is harmless to leave in place.

If you were chasing GPU peer-to-peer (P2P) specifically, stop here: as of now, P2P between two R9700s is unavailable at the driver level on gfx1201 regardless of BAR size. hipDeviceCanAccessPeer returns false and the kernel establishes no P2P links. It is a known RDNA4 driver limitation, not something the board or BIOS can fix. This may be addressed in a future ROCm or kernel release, so it is worth re-testing after updates. For layer-split inference it does not matter either way, because that workload does not use P2P.

Part 2: Enabling the amdgpu Driver (Fedora Server)

Fedora Server Edition does not include the amdgpu kernel driver in its default install. On a fresh Server install the cards are detected by the PCI subsystem but no driver binds to them, so rocm-smi and any GPU workload will not see them. The Workstation edition includes the driver out of the box; Server does not. This is the first thing to fix, before installing ROCm or running anything.

The amdgpu module can be found in the kernel-modules-extra package, which matches your running kernel version. Install it:

sudo dnf install kernel-modules-extra

After installing, either reboot or load the module manually:

sudo modprobe amdgpu

Confirm the driver is loaded and bound to the cards:

lspci -k | grep -A3 -i "Radeon AI PRO"   # should list amdgpu as the kernel driver in use
lsmod | grep amdgpu                       # module loaded

The container that runs inference also needs access to the GPU device nodes. These are created by the driver once it is loaded:

/dev/kfd (the compute interface)
/dev/dri/renderD128, /dev/dri/renderD129 and /dev/dri/card0, /dev/dri/card1 (the DRI render nodes for each card)
Pass these into the container (in the Compose file, via devices:) so llama.cpp can reach the GPUs.

One thing to keep in mind on Fedora Server: every kernel update needs the matching kernel-modules-extra for that kernel version, or the driver will not load after the update. If you boot a new kernel and the cards disappear, an absent kernel-modules-extra for the new version is the usual cause.

Part 3: Installing the Latest ROCm

Fedora's repositories lag well behind on ROCm. To get a current version (7.2 at time of writing) you add AMD's own repository. The important part is installing the userspace components only and skipping the DKMS kernel module, so you do not disturb the working kernel and the amdgpu driver from Part 2. This reflects the state of the repositories as of now; the distribution may catch up or AMD may change the repository layout in future, so check for a newer version before pinning to 7.2.

Step 1: Add AMD's ROCm repository

sudo dnf install https://repo.radeon.com/amdgpu-install/7.2/rhel/10/amdgpu-install-7.2.70200-1.el10.noarch.rpm

The installer creates several repository definitions. On Fedora some of them point at paths that return 404. Disable the ones that do not resolve and keep only the main ROCm repository:

sudo dnf config-manager setopt amdgraphics.enabled=0
sudo dnf config-manager setopt amdgpu.enabled=0
sudo dnf config-manager setopt amdgpu-proprietary.enabled=0

The repository you keep enabled is [rocm], with a base URL of https://repo.radeon.com/rocm/el10/7.2/main.

Step 2: Install userspace ROCm only

Do not install amdgpu-dkms. The goal is the ROCm runtime and tools, not a replacement kernel driver. Install the components you need:

sudo dnf install libquadmath
sudo dnf install rocm-core hip-runtime-amd hsa-rocr comgr \
    rocm-device-libs rccl rocm-smi-lib rocminfo
sudo dnf install hipcc hip-devel

libquadmath is a dependency that has to go in first. Skip rocm-opencl, which conflicts with ocl-icd. The RCCL version that installs here is 2.27.7.

Step 3: Set up paths

The install symlinks /opt/rocm to /opt/rocm-7.2.0. Add the binaries to your path and register the libraries:

echo 'export PATH=/opt/rocm/bin:$PATH' >> ~/.bashrc
echo '/opt/rocm/lib' | sudo tee /etc/ld.so.conf.d/rocm.conf
sudo ldconfig

Step 4: Verify

rocminfo | grep -c gfx1201    # expect 2
/opt/rocm/bin/rocm-smi        # both cards listed

If a shell does not have the path loaded, call the tool by its full path, /opt/rocm/bin/rocm-smi.

Part 4: Getting llama.cpp to Run at Full Speed

This is the part that actually matters day to day. The inference container runs llama.cpp with the Vulkan backend (RADV/Mesa), layer-split across both R9700s, serving a Qwen3.6-35B-A3B Q8 model. Warm, it runs at around 78 tokens per second. The problem is cold boot.

The symptom

After a fresh boot, the first inference runs at roughly 30 tokens per second. Restarting the container makes it jump to 78 and stay there. The model, the configuration, and the split are identical in both cases.

The diagnosis

Watching the cards with rocm-smi during inference in each state makes the cause clear.

Slow, straight after boot:

card0:  160MHz   GPU 18%
card1: 1093MHz   GPU  6%

Fast, after a container restart:

card0: 1628MHz   GPU 94%
card1: 1770MHz   GPU 99%

Same workload, same PCIe link (confirmed Gen3 x16 on both), same BAR size. The only difference is the GPU clock state. After a cold boot the R9700s sit in a low clock state and do not ramp for the workload. Restarting the container tears down and recreates the Vulkan context, which triggers the cards to clock up properly and stay there.

This was confirmed against the alternatives. It is not the BAR size, the IOMMU, the PCIe link width, the device enumeration order, the Mesa shader cache, or the warmup setting. The userspace clock controls on the R9700 do not help either: forcing the maximum DPM level with pp_dpm_sclk does not hold the clock up at idle, and the card does not expose pp_od_clk_voltage, so there is no minimum-clock floor to set. The container restart is the only reliable reset.

The fix

Since the container restart is the proven reset, the clean solution is to do it automatically once per boot, after the GPU stack and the first container start have settled. A oneshot systemd service handles this:

[Unit]
Description=Restart llama container after boot to reset GPU clock state
After=docker.service
Requires=docker.service
RequiresMountsFor=/path/to/your/container
 
[Service]
Type=oneshot
WorkingDirectory=/path/to/your/container
ExecStartPre=/bin/sleep 60
ExecStart=/usr/bin/docker compose restart
RemainAfterExit=yes
 
[Install]
WantedBy=multi-user.target

Save it as /etc/systemd/system/llama-warmup.service and enable it:

sudo systemctl daemon-reload
sudo systemctl enable llama-warmup.service

NOTE: Change /path/to/your/container to match your container's path.

How it works:

After=docker.service and Requires=docker.service ensure Docker is up first.
RequiresMountsFor makes the service wait for the filesystem the container lives on, in case it mounts late during boot.
ExecStartPre=/bin/sleep 60 gives the GPU driver and the first (slow) container start time to come up before the reset fires.
docker compose restart from the working directory restarts the existing container without recreating it, which is what resets the GPU clock state.
The 60 second delay is a starting point. If a cold boot still comes up slow, the restart fired before the first container start finished loading. Increase the sleep to 90 or 120 seconds and reboot to test again.

Kernel parameters

The relevant kernel command line for this rig, in /etc/default/grub:

pcie_aspm=off amd_iommu=off intel_iommu=off pci=realloc pci=nocrs

pcie_aspm=off disables PCIe link power management to avoid link latency. The IOMMU is disabled, and pci=realloc and pci=nocrs let the kernel rebuild the PCI address map rather than trusting the firmware's. With this combination the cards come up cleanly at a 256MB BAR with no allocation errors in the log. None of these parameters are strictly required for inference once the ReBAR experiments are abandoned, but they are kept here as the known-good configuration.

Conclusion

The X99-E WS is a 2014-era board, and getting two modern RDNA4 cards running well on it takes work the platform was never designed for. The BIOS flash enables Resizable BAR the board never officially had. ROCm has to come from AMD directly because the distribution lags. And the cards have a cold-boot clock-state quirk that no amount of host-side clock pinning fixes, only a context re-initialisation, which a small restart service automates.