Hacker News 热门（buzzing.cc 中文翻译）

两节点 AMD Strix Halo RDMA 集群设置指南

2026-06-28 14:50·3小时前·jakogut

AI 摘要

本文介绍如何配置两节点 AMD Strix Halo 集群，通过 Intel E810 (RoCE v2) 网卡实现 RDMA 互联，用于分布式 vLLM 推理（Tensor Parallelism）。硬件采用两块 Framework Desktop 主板（AMD Ryzen AI MAX+ "Strix Halo"，128GB 统一内存）及 E810-CQDA1 100GbE 网卡，直连无需交换机。软件栈基于 Fedora 43，使用 Ray 编排集群、RCCL 通信。RDMA 延迟约 5µs（TCP/IP 为 70‑100µs）。涵盖 BIOS 设置、网络配置、工具箱安装及启动集群的详细步骤。

原文 · 未翻译

You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert

kyuz0/**amd-strix-halo-vllm-toolboxes**Public

NotificationsYou must be signed in to change notification settings

Collapse file tree

Files

main

Search this repository(forward slash)forward slash/

setup_guide.md

Copy path

Blame More file actions

Image 1: author

Donato Capitella

updated benchmarks including thunderbolt and configuratuion guides

success

Feb 25, 2026

b035bcb·Feb 25, 2026

History

Open commit details

History

393 lines (291 loc) · 16.7 KB

setup_guide.md

Copy path

Top

File metadata and controls

Preview
Code
Blame

393 lines (291 loc) · 16.7 KB

Raw

Copy raw file

Download raw file

Outline

Edit and raw actions

AMD Strix Halo RDMA Cluster Setup Guide

This guide details how to configure a two-node AMD Strix Halo cluster linked via Intel E810 (RoCE v2) for distributed vLLM inference using Tensor Parallelism.

1. TL;DR (Quick Start)

On Both Nodes:

Preparation: * Install/Update Fedora 43 and the E810 NICs (Check firmware: ethtool -i <iface>). * BIOS/Kernel: Set iGPU to 512MB and apply kernel params (iommu=pt, pci=realloc, etc.). * SSH: Configure passwordless SSH between nodes.
Networking: Assign static IPs (192.168.100.1&.2), set MTU 9000, and trust the interface in firewall.
Install Toolbox: Run ./refresh_toolbox.sh (this automatically installs the container with RDMA support and the custom librccl.so patch).
Run Cluster: * Run start-vllm-cluster. * Select "2. Start Ray Cluster" (Follow prompts using the TUI). * Select "4. Launch VLLM Serve" and choose your model. (Export HF_TOKEN first for gated models!)

Key Note: The refresh_toolbox.sh script detects your Infiniband/RDMA devices and automatically configures the container to expose them.

2. Concepts & Architecture

To fully utilize the Strix Halo cluster, it is helpful to understand the technologies involved:

vLLM: A high-performance inference engine. To run models larger than a single GPU (or APU) can handle, it splits the model using Tensor Parallelism (TP).
Ray: A distributed computing framework. vLLM uses Ray to orchestrate the cluster, manage the "worker" processes on each node, and ensure they start up correctly. Ray handles the control plane (issuing commands).
RCCL (ROCm Collective Communication Library): The AMD equivalent of NVIDIA's NCCL. This library handles the data plane—specifically, the extremely fast synchronization of tensor data between GPUs. When TP=2, the two nodes must exchange partial results after every single layer of the neural network. This happens thousands of times per second.
RoCE v2 (RDMA over Converged Ethernet): The protocol that allows RCCL to write data directly from one Node's memory to the other Node's memory, bypassing the CPU and OS kernel.
- Without RDMA: Latency is ~70-100µs (TCP/IP overhead).
- With RDMA: Latency is ~5µs.
- Why it matters: For interactive token generation, high latency kills performance. RoCE makes the two nodes feel like a single machine.

3. Hardware Prerequisites

Nodes: 2x Framework Desktop Mainboards with AMD Ryzen AI MAX+ "Strix Halo", 128GB of Unified Memory.
Network Cards: Intel Ethernet Controller E810-CQDA1 (or similar 100GbE QSFP28).
Connection: Direct Attach Copper (DAC) cable (e.g., QSFPTEK 100G QSFP28 DAC). No switch required for 2 nodes.
PCIe Note: The Framework motherboard PCIe slot is physically x4, so a riser is required to plug in a 16x card (e.g., CY PCI-E Express 4x to 16x Extender). Test Setup Note: One of the boards in this setup has a modified PCIe slot (cut by Framework using an ultrasonic knife) to accept x16 cards directly. This is not recommended for users. Risers are the cheaper, safer, and easier solution. Performance is identical (~50Gbps bandwidth, ~5µs latency).

4. Host Configuration (Fedora)

Perform these steps on the Host OS (Fedora 43) of both nodes.

Tested Host Configuration:

Node	Kernel	OS	IP (RDMA Interface)
Node 1	`6.18.5-200.fc43.x86_64`	Fedora Linux 43	`192.168.100.1/30`
Node 2	`6.18.6-200.fc43.x86_64`	Fedora Linux 43	`192.168.100.2/30`

Note: These specific kernel versions were verified to work. Fedora 43 is recommended.

4.1 Install Packages

Install the core RDMA userspace tools. You do not need proprietary Intel drivers; the in-kernel drivers work perfectly.

Ethernet Driver:ice
RDMA Driver:irdma (Unified driver for RoCE v2 & iWARP)

undefinedshell sudo dnf install rdma-core libibverbs-utils perftest undefined

rdma-core: The userspace components for the RDMA subsystem (libraries, daemons, and configuration tools).
libibverbs-utils: Utilities for querying RDMA devices (e.g., ibv_devinfo).
perftest: A suite of benchmarks (e.g., ib_write_bw, ib_send_lat) to verify RDMA bandwidth and latency.

4.2 Check Native Firmware

Use ethtool to check the current firmware version of your Intel E810 card.

undefinedshell ethtool -i enp194s0np0 undefined

Ensure your firmware is at least as new as the version shown below (Firmware ). If your firmware is older, please update it using the .

Example Output:

driver: ice
version: 6.18.5-200.fc43.x86_64
firmware-version: 4.91 0x800214b5 1.3909.0
expansion-rom-version: 
bus-info: 0000:c2:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

4.3 Network Configuration

This guide assumes a subnet of 192.168.100.0/30.

Identify your interface: Run ip link to find your 100GbE card (e.g., enp194s0np0).

Node 1 (Head - 192.168.100.1):

undefinedshell

Bring link up

sudo ip link set enp194s0np0 up

Assign IP

sudo ip addr add 192.168.100.1/30 dev enp194s0np0

Set MTU (Jumbo Frames)

sudo nmcli connection modify "rdma0" ethernet.mtu 9000 sudo nmcli connection up "rdma0" undefined

Node 2 (Worker - 192.168.100.2):

undefinedshell

Bring link up

sudo ip link set enp194s0np0 up

Assign IP

sudo ip addr add 192.168.100.2/30 dev enp194s0np0

Set MTU

sudo nmcli connection modify "rdma0" ethernet.mtu 9000 sudo nmcli connection up "rdma0" undefined

Verify Routing: Ensure the route exists on both:

undefinedshell sudo ip route add 192.168.100.0/30 dev enp194s0np0 undefined

Verify Link:

undefinedshell rdma link

Output should show: state ACTIVE physical_state LINK_UP used_usec X ...

undefined

4.4 BIOS & Kernel Configuration

1. BIOS Settings: Set the iGPU Memory Allocation to the minimum possible (512MB). We will use the GTT (Graphics Translation Table) to dynamically allocate system memory as "Unified Memory" for the GPU.

2. Kernel Parameters: Update GRUB to enable unified memory, optimize RDMA performance, and fix PCI resource allocation.

Edit /etc/default/grub and append to GRUB_CMDLINE_LINUX:

iommu=pt pci=realloc pcie_aspm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856

Explanation of Parameters:

iommu=pt: Sets IOMMU to "Pass-Through" mode. This is critical for performance, reducing overhead for both the RDMA NIC and the iGPU unified memory access.
pci=realloc: Reallocates PCI BARs. Often needed on consumer platforms to properly map large address spaces for devices like the E810 or Strix Halo.
pcie_aspm=off: Disables PCIe Active State Power Management. Prevents latency spikes and link negotiation issues on the 100GbE connection.
amdgpu.gttsize=126976: Caps the GPU GTT size to ~124GiB (126976MB). This defines how much system RAM the GPU can address as its own "VRAM".
ttm.pages_limit=32505856: Limits the Translation Table Manager to ~124GiB (in 4KB pages), matching the GTT size.

3. Apply Changes:

undefinedshell sudo grub2-mkconfig -o /boot/grub2/grub.cfg sudo reboot undefined

4.5 Firewall Rules

Applications like Ray and NCCL use random high ports. It is easiest to trust the internal RDMA interface completely.

undefinedshell

Assign the interface to the trusted zone permanently

sudo firewall-cmd --permanent --zone=trusted --add-interface=enp194s0np0

Reload firewall

sudo firewall-cmd --reload undefined

5. Toolbox Installation & Network Verification

5.1 Prerequisites: Passwordless SSH

The cluster management and verification scripts rely on SSH to execute commands on remote nodes. You must configure passwordless SSH between both nodes (root or sudo-enabled user).

Guide:How to Set Up SSH Keys on Linux (DigitalOcean)
Quick Check: Run ssh <other-node-ip> date from each node. It should print the date without asking for a password.

5.2 Installation

The toolbox container provided in this repo includes a critical patch: a custom-built librccl.so that enables gfx1151 (Strix Halo) support for RDMA (https://github.com/kyuz0/rocm-systems/tree/gfx1151-rccl), which is currently missing in upstream ROCm packages. This library is automatically compiled using the build-rccl GitHub Action in this repository, which generates the artifact that is then bundled into the Docker container.

To install the toolbox on both nodes, run:

undefinedshell ./refresh_toolbox.sh undefined

What this does:

Pulls the latest kyuz0/vllm-therock-gfx1151 image.
Detects if /dev/infiniband exists on your host.
Creates the toolbox with flags to expose: * iGPU Access: /dev/dri, /dev/kfd (Required for ROCm) * RDMA Access: /dev/infiniband, --group-add rdma * Memory Pinning: --ulimit memlock=-1 (Required for DMA)

5.3 Verify RDMA Connection

Before proceeding to run the cluster, verify that RDMA is active and providing low latency (~5µs vs ~70µs for Ethernet).

Run the provided verification script from the Head Node:

undefinedshell

Inside toolbox

/opt/compare_eth_vs_rdma.sh undefined

Expected Results:

Path                 Latency      Bandwidth   
------------------------------------------------
Ethernet (1G LAN)    0.074 ms     0.94 Gbps   
Ethernet (RoCE NIC)  0.068 ms     55.70 Gbps  
RDMA (RoCE)          5.23 us      50.64 Gbps

Note the massive latency drop (milliseconds to microseconds) for RDMA.

6. Running the Cluster

A TUI utility, start-vllm-cluster, is provided to manage the Ray cluster and vLLM.

6.1 Setup & Verify

Enter the toolbox:

undefinedshell toolbox enter vllm undefined

Run the Cluster Manager:

undefinedshell start-vllm-cluster undefined

Configure IPs (Option 1): * Ensure Head is 192.168.100.1 and Worker is 192.168.100.2.
Start Ray Cluster (Option 2): * On Node 1: Select "Head" when prompted. * On Node 2: Select "Worker" when prompted. * The script effectively runs:

undefinedshell

Head

export NCCL_SOCKET_IFNAME= ray start --head --node-ip-address=192.168.100.1 ...

Worker

ray start --address=192.168.100.1:6379 ... undefined

Check Status (Option 3): * Ensure you see 2 nodes and adequate GPU resources (e.g., 2.0 GPU).

6.2 Launching vLLM

Once the cluster is active (checked via Option 3):

Select "4. Launch VLLM Serve" in the TUI.
Choose a model (e.g., Meta-Llama-3.1-8B-Instruct).
Configuration Menu: * Tensor Parallelism: Set to 2 (one GPU per node). * Context Length: Auto or custom (e.g., 131072). * Erase vLLM Cache: Select YES if you are restarting after a crash. * Force Eager Mode: Select YES.
- Why? CUDA Graphs can be unstable on distributed APU clusters and cause deadlocks. Eager mode is safer, but you might be able to squeeze 1-3% more performance if you take a chance and disable it.
Launch: Select "LAUNCH SERVER".

Important Gotchas:

First Run Download: When running a model for the first time, each node in the cluster must download the weights independently. This may take some time depending on your internet connection.
Gated Models (e.g., Gemma):
- Models like google/gemma-2-27b-it are "gated" and require you to request access on Hugging Face.
- You must export your Hugging Face token before running the cluster script:

undefinedshell export HF_TOKEN=your_token_here start-vllm-cluster undefined

*   If you don't provide a token or haven't accepted the license on Hugging Face, the download will fail.

7. Troubleshooting

vLLM Deadlocks / Hangs

Cause: CUDA Graph capture can freeze on distributed APU nodes.
Fix: Enable "Force Eager Mode" in the start menu.

Firmware

If you see link issues, ensure your Intel E810 firmware is up to date using the Intel standard tools.

8. References & Acknowledgements

Reddit - Strix Halo Batching with Tensor Parallel: Thread by Hungry_Elk_3276
- Special thanks to user Hungry_Elk_3276 for their initial experiments with vLLM RDMA, which highlighted the missing gfx1151 support in upstream RCCL.

9. Alternative: Thunderbolt Networking

If you do not have dedicated 100GbE RDMA network cards, you can directly connect the two nodes using a high-quality Thunderbolt 4 / USB4 cable. This will create a thunderbolt0 network interface.

While it lacks the ultra-low microprocessor-level latency of RDMA, it provides significantly more bandwidth than standard 1GbE/5GbE Ethernet and is easier to configure.

Note: thunderbolt-net relies on standard OS kernel TCP/IP stacks.

9.1 Thunderbolt Configuration

1. Establish Connection: Connect the nodes directly using a certified Thunderbolt 4 or USB4 cable. Verify the link is active:

undefinedshell ip link show thunderbolt0 undefined

2. Network Configuration (Head - Node 1): Configure a persistent connection using nmcli with a static IP and Jumbo Frames (reduces CPU overhead). Note: Jumbo Frames may be unsupported on some Thunderbolt host controllers.

undefinedshell sudo nmcli connection add type ethernet ifname thunderbolt0 con-name thunderbolt0 ipv4.method manual ipv4.addresses 192.168.2.1/24 mtu 9000 sudo nmcli connection up thunderbolt0 undefined

3. Network Configuration (Worker - Node 2):

undefinedshell sudo nmcli connection add type ethernet ifname thunderbolt0 con-name thunderbolt0 ipv4.method manual ipv4.addresses 192.168.2.2/24 mtu 9000 sudo nmcli connection up thunderbolt0 undefined

4. Firewall Rules: To ensure Ray and NCCL can communicate freely over this link:

undefinedshell

Assign the interface to the trusted zone permanently

sudo firewall-cmd --permanent --zone=trusted --add-interface=thunderbolt0 sudo firewall-cmd --reload undefined

9.2 Running vLLM over Thunderbolt

Our cluster scripts dynamically detect the network interface based on the provided IPs. There is no need to manually export environment variables!

Open the Toolbox: toolbox enter vllm
Launch the cluster manager: start-vllm-cluster
Select Option 1 (Configure IPs).
Set the Head IP explicitly to 192.168.2.1 and the Worker IP to 192.168.2.2.
Start the cluster normally (Option 2). The script will automatically discover and utilize thunderbolt0 as the backend network for Ray orchestration and GPU synchronization.

9.3 Validating the Link

I have added Thunderbolt support to the compare_eth_vs_rdma.sh script. Run it from inside the toolbox to see the latency and bandwidth of your Thunderbolt link compared to your other network interfaces.

You can use the -t flag to ONLY benchmark the Thunderbolt connection (or -e, -r, -i for the others):

undefinedshell /opt/compare_eth_vs_rdma.sh -t undefined

You can’t perform that action at this time.

推理教程/实践部署/工程

阅读原文

Hacker News 热门（buzzing.cc 中文翻译）

两节点 AMD Strix Halo RDMA 集群设置指南

2026-06-28 14:50·3小时前·jakogut

阅读原文· github.com

AI 摘要

原文 · 保持原样，未翻译

kyuz0/**amd-strix-halo-vllm-toolboxes**Public

NotificationsYou must be signed in to change notification settings

Collapse file tree

Files

main

Search this repository(forward slash)forward slash/

setup_guide.md

Copy path

Blame More file actions

Image 1: author

Donato Capitella

updated benchmarks including thunderbolt and configuratuion guides

success

Feb 25, 2026

b035bcb·Feb 25, 2026

History

Open commit details

History

393 lines (291 loc) · 16.7 KB

setup_guide.md

Copy path

Top

File metadata and controls

Preview
Code
Blame

393 lines (291 loc) · 16.7 KB

Raw

Copy raw file

Download raw file

Outline

Edit and raw actions

AMD Strix Halo RDMA Cluster Setup Guide

This guide details how to configure a two-node AMD Strix Halo cluster linked via Intel E810 (RoCE v2) for distributed vLLM inference using Tensor Parallelism.

1. TL;DR (Quick Start)

On Both Nodes:

Preparation: * Install/Update Fedora 43 and the E810 NICs (Check firmware: ethtool -i <iface>). * BIOS/Kernel: Set iGPU to 512MB and apply kernel params (iommu=pt, pci=realloc, etc.). * SSH: Configure passwordless SSH between nodes.
Networking: Assign static IPs (192.168.100.1&.2), set MTU 9000, and trust the interface in firewall.
Install Toolbox: Run ./refresh_toolbox.sh (this automatically installs the container with RDMA support and the custom librccl.so patch).
Run Cluster: * Run start-vllm-cluster. * Select "2. Start Ray Cluster" (Follow prompts using the TUI). * Select "4. Launch VLLM Serve" and choose your model. (Export HF_TOKEN first for gated models!)

Key Note: The refresh_toolbox.sh script detects your Infiniband/RDMA devices and automatically configures the container to expose them.

2. Concepts & Architecture

To fully utilize the Strix Halo cluster, it is helpful to understand the technologies involved:

vLLM: A high-performance inference engine. To run models larger than a single GPU (or APU) can handle, it splits the model using Tensor Parallelism (TP).
Ray: A distributed computing framework. vLLM uses Ray to orchestrate the cluster, manage the "worker" processes on each node, and ensure they start up correctly. Ray handles the control plane (issuing commands).
RCCL (ROCm Collective Communication Library): The AMD equivalent of NVIDIA's NCCL. This library handles the data plane—specifically, the extremely fast synchronization of tensor data between GPUs. When TP=2, the two nodes must exchange partial results after every single layer of the neural network. This happens thousands of times per second.
RoCE v2 (RDMA over Converged Ethernet): The protocol that allows RCCL to write data directly from one Node's memory to the other Node's memory, bypassing the CPU and OS kernel.
- Without RDMA: Latency is ~70-100µs (TCP/IP overhead).
- With RDMA: Latency is ~5µs.
- Why it matters: For interactive token generation, high latency kills performance. RoCE makes the two nodes feel like a single machine.

3. Hardware Prerequisites

Nodes: 2x Framework Desktop Mainboards with AMD Ryzen AI MAX+ "Strix Halo", 128GB of Unified Memory.
Network Cards: Intel Ethernet Controller E810-CQDA1 (or similar 100GbE QSFP28).
Connection: Direct Attach Copper (DAC) cable (e.g., QSFPTEK 100G QSFP28 DAC). No switch required for 2 nodes.
PCIe Note: The Framework motherboard PCIe slot is physically x4, so a riser is required to plug in a 16x card (e.g., CY PCI-E Express 4x to 16x Extender). Test Setup Note: One of the boards in this setup has a modified PCIe slot (cut by Framework using an ultrasonic knife) to accept x16 cards directly. This is not recommended for users. Risers are the cheaper, safer, and easier solution. Performance is identical (~50Gbps bandwidth, ~5µs latency).

4. Host Configuration (Fedora)

Perform these steps on the Host OS (Fedora 43) of both nodes.

Tested Host Configuration:

Node	Kernel	OS	IP (RDMA Interface)
Node 1	`6.18.5-200.fc43.x86_64`	Fedora Linux 43	`192.168.100.1/30`
Node 2	`6.18.6-200.fc43.x86_64`	Fedora Linux 43	`192.168.100.2/30`

Note: These specific kernel versions were verified to work. Fedora 43 is recommended.

4.1 Install Packages

Install the core RDMA userspace tools. You do not need proprietary Intel drivers; the in-kernel drivers work perfectly.

Ethernet Driver:ice
RDMA Driver:irdma (Unified driver for RoCE v2 & iWARP)

undefinedshell sudo dnf install rdma-core libibverbs-utils perftest undefined

rdma-core: The userspace components for the RDMA subsystem (libraries, daemons, and configuration tools).
libibverbs-utils: Utilities for querying RDMA devices (e.g., ibv_devinfo).
perftest: A suite of benchmarks (e.g., ib_write_bw, ib_send_lat) to verify RDMA bandwidth and latency.

4.2 Check Native Firmware

Use ethtool to check the current firmware version of your Intel E810 card.

undefinedshell ethtool -i enp194s0np0 undefined

Ensure your firmware is at least as new as the version shown below (Firmware ). If your firmware is older, please update it using the .

Example Output:

driver: ice
version: 6.18.5-200.fc43.x86_64
firmware-version: 4.91 0x800214b5 1.3909.0
expansion-rom-version: 
bus-info: 0000:c2:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

4.3 Network Configuration

This guide assumes a subnet of 192.168.100.0/30.

Identify your interface: Run ip link to find your 100GbE card (e.g., enp194s0np0).

Node 1 (Head - 192.168.100.1):

undefinedshell

Bring link up

sudo ip link set enp194s0np0 up

Assign IP

sudo ip addr add 192.168.100.1/30 dev enp194s0np0

Set MTU (Jumbo Frames)

sudo nmcli connection modify "rdma0" ethernet.mtu 9000 sudo nmcli connection up "rdma0" undefined

Node 2 (Worker - 192.168.100.2):

undefinedshell

Bring link up

sudo ip link set enp194s0np0 up

Assign IP

sudo ip addr add 192.168.100.2/30 dev enp194s0np0

Set MTU

sudo nmcli connection modify "rdma0" ethernet.mtu 9000 sudo nmcli connection up "rdma0" undefined

Verify Routing: Ensure the route exists on both:

undefinedshell sudo ip route add 192.168.100.0/30 dev enp194s0np0 undefined

Verify Link:

undefinedshell rdma link

Output should show: state ACTIVE physical_state LINK_UP used_usec X ...

undefined

4.4 BIOS & Kernel Configuration

2. Kernel Parameters: Update GRUB to enable unified memory, optimize RDMA performance, and fix PCI resource allocation.

Edit /etc/default/grub and append to GRUB_CMDLINE_LINUX:

iommu=pt pci=realloc pcie_aspm=off amdgpu.gttsize=126976 ttm.pages_limit=32505856

Explanation of Parameters:

iommu=pt: Sets IOMMU to "Pass-Through" mode. This is critical for performance, reducing overhead for both the RDMA NIC and the iGPU unified memory access.
pci=realloc: Reallocates PCI BARs. Often needed on consumer platforms to properly map large address spaces for devices like the E810 or Strix Halo.
pcie_aspm=off: Disables PCIe Active State Power Management. Prevents latency spikes and link negotiation issues on the 100GbE connection.
amdgpu.gttsize=126976: Caps the GPU GTT size to ~124GiB (126976MB). This defines how much system RAM the GPU can address as its own "VRAM".
ttm.pages_limit=32505856: Limits the Translation Table Manager to ~124GiB (in 4KB pages), matching the GTT size.

3. Apply Changes:

undefinedshell sudo grub2-mkconfig -o /boot/grub2/grub.cfg sudo reboot undefined

4.5 Firewall Rules

Applications like Ray and NCCL use random high ports. It is easiest to trust the internal RDMA interface completely.

undefinedshell

Assign the interface to the trusted zone permanently

sudo firewall-cmd --permanent --zone=trusted --add-interface=enp194s0np0

Reload firewall

sudo firewall-cmd --reload undefined

5. Toolbox Installation & Network Verification

5.1 Prerequisites: Passwordless SSH

The cluster management and verification scripts rely on SSH to execute commands on remote nodes. You must configure passwordless SSH between both nodes (root or sudo-enabled user).

Guide:How to Set Up SSH Keys on Linux (DigitalOcean)
Quick Check: Run ssh <other-node-ip> date from each node. It should print the date without asking for a password.

5.2 Installation

To install the toolbox on both nodes, run:

undefinedshell ./refresh_toolbox.sh undefined

What this does:

Pulls the latest kyuz0/vllm-therock-gfx1151 image.
Detects if /dev/infiniband exists on your host.
Creates the toolbox with flags to expose: * iGPU Access: /dev/dri, /dev/kfd (Required for ROCm) * RDMA Access: /dev/infiniband, --group-add rdma * Memory Pinning: --ulimit memlock=-1 (Required for DMA)

5.3 Verify RDMA Connection

Before proceeding to run the cluster, verify that RDMA is active and providing low latency (~5µs vs ~70µs for Ethernet).

Run the provided verification script from the Head Node:

undefinedshell

Inside toolbox

/opt/compare_eth_vs_rdma.sh undefined

Expected Results:

Path                 Latency      Bandwidth   
------------------------------------------------
Ethernet (1G LAN)    0.074 ms     0.94 Gbps   
Ethernet (RoCE NIC)  0.068 ms     55.70 Gbps  
RDMA (RoCE)          5.23 us      50.64 Gbps

Note the massive latency drop (milliseconds to microseconds) for RDMA.

6. Running the Cluster

A TUI utility, start-vllm-cluster, is provided to manage the Ray cluster and vLLM.

6.1 Setup & Verify

Enter the toolbox:

undefinedshell toolbox enter vllm undefined

Run the Cluster Manager:

undefinedshell start-vllm-cluster undefined

Configure IPs (Option 1): * Ensure Head is 192.168.100.1 and Worker is 192.168.100.2.
Start Ray Cluster (Option 2): * On Node 1: Select "Head" when prompted. * On Node 2: Select "Worker" when prompted. * The script effectively runs:

undefinedshell

Head

export NCCL_SOCKET_IFNAME= ray start --head --node-ip-address=192.168.100.1 ...

Worker

ray start --address=192.168.100.1:6379 ... undefined

Check Status (Option 3): * Ensure you see 2 nodes and adequate GPU resources (e.g., 2.0 GPU).

6.2 Launching vLLM

Once the cluster is active (checked via Option 3):

Select "4. Launch VLLM Serve" in the TUI.
Choose a model (e.g., Meta-Llama-3.1-8B-Instruct).
Configuration Menu: * Tensor Parallelism: Set to 2 (one GPU per node). * Context Length: Auto or custom (e.g., 131072). * Erase vLLM Cache: Select YES if you are restarting after a crash. * Force Eager Mode: Select YES.
- Why? CUDA Graphs can be unstable on distributed APU clusters and cause deadlocks. Eager mode is safer, but you might be able to squeeze 1-3% more performance if you take a chance and disable it.
Launch: Select "LAUNCH SERVER".

Important Gotchas:

First Run Download: When running a model for the first time, each node in the cluster must download the weights independently. This may take some time depending on your internet connection.
Gated Models (e.g., Gemma):
- Models like google/gemma-2-27b-it are "gated" and require you to request access on Hugging Face.
- You must export your Hugging Face token before running the cluster script:

undefinedshell export HF_TOKEN=your_token_here start-vllm-cluster undefined

*   If you don't provide a token or haven't accepted the license on Hugging Face, the download will fail.

7. Troubleshooting

vLLM Deadlocks / Hangs

Cause: CUDA Graph capture can freeze on distributed APU nodes.
Fix: Enable "Force Eager Mode" in the start menu.

Firmware

If you see link issues, ensure your Intel E810 firmware is up to date using the Intel standard tools.

8. References & Acknowledgements

Reddit - Strix Halo Batching with Tensor Parallel: Thread by Hungry_Elk_3276
- Special thanks to user Hungry_Elk_3276 for their initial experiments with vLLM RDMA, which highlighted the missing gfx1151 support in upstream RCCL.

9. Alternative: Thunderbolt Networking

While it lacks the ultra-low microprocessor-level latency of RDMA, it provides significantly more bandwidth than standard 1GbE/5GbE Ethernet and is easier to configure.

Note: thunderbolt-net relies on standard OS kernel TCP/IP stacks.

9.1 Thunderbolt Configuration

1. Establish Connection: Connect the nodes directly using a certified Thunderbolt 4 or USB4 cable. Verify the link is active:

undefinedshell ip link show thunderbolt0 undefined

3. Network Configuration (Worker - Node 2):

4. Firewall Rules: To ensure Ray and NCCL can communicate freely over this link:

undefinedshell

Assign the interface to the trusted zone permanently

sudo firewall-cmd --permanent --zone=trusted --add-interface=thunderbolt0 sudo firewall-cmd --reload undefined

9.2 Running vLLM over Thunderbolt

Our cluster scripts dynamically detect the network interface based on the provided IPs. There is no need to manually export environment variables!

Open the Toolbox: toolbox enter vllm
Launch the cluster manager: start-vllm-cluster
Select Option 1 (Configure IPs).
Set the Head IP explicitly to 192.168.2.1 and the Worker IP to 192.168.2.2.
Start the cluster normally (Option 2). The script will automatically discover and utilize thunderbolt0 as the backend network for Ray orchestration and GPU synchronization.

9.3 Validating the Link

You can use the -t flag to ONLY benchmark the Thunderbolt connection (or -e, -r, -i for the others):

undefinedshell /opt/compare_eth_vs_rdma.sh -t undefined

You can’t perform that action at this time.

两节点 AMD Strix Halo RDMA 集群设置指南

Collapse file tree

Files

setup_guide.md

History

setup_guide.md

File metadata and controls

AMD Strix Halo RDMA Cluster Setup Guide

Table of Contents

1. TL;DR (Quick Start)

2. Concepts & Architecture

3. Hardware Prerequisites

4. Host Configuration (Fedora)

4.1 Install Packages

4.2 Check Native Firmware

4.3 Network Configuration

Bring link up

Assign IP

Set MTU (Jumbo Frames)

Bring link up

Assign IP

Set MTU

Output should show: state ACTIVE physical_state LINK_UP used_usec X ...

4.4 BIOS & Kernel Configuration

4.5 Firewall Rules

Assign the interface to the trusted zone permanently

Reload firewall

5. Toolbox Installation & Network Verification

5.1 Prerequisites: Passwordless SSH

5.2 Installation

5.3 Verify RDMA Connection

Inside toolbox

6. Running the Cluster

6.1 Setup & Verify

Head

Worker

6.2 Launching vLLM

7. Troubleshooting

vLLM Deadlocks / Hangs

Firmware

8. References & Acknowledgements

9. Alternative: Thunderbolt Networking

9.1 Thunderbolt Configuration

Assign the interface to the trusted zone permanently

9.2 Running vLLM over Thunderbolt

9.3 Validating the Link

两节点 AMD Strix Halo RDMA 集群设置指南

Collapse file tree

Files

setup_guide.md

History

setup_guide.md

File metadata and controls

AMD Strix Halo RDMA Cluster Setup Guide

Table of Contents

1. TL;DR (Quick Start)

2. Concepts & Architecture

3. Hardware Prerequisites

4. Host Configuration (Fedora)

4.1 Install Packages

4.2 Check Native Firmware

4.3 Network Configuration

Bring link up

Assign IP

Set MTU (Jumbo Frames)

Bring link up

Assign IP

Set MTU

Output should show: state ACTIVE physical_state LINK_UP used_usec X ...

4.4 BIOS & Kernel Configuration

4.5 Firewall Rules

Assign the interface to the trusted zone permanently

Reload firewall

5. Toolbox Installation & Network Verification

5.1 Prerequisites: Passwordless SSH

5.2 Installation

5.3 Verify RDMA Connection

Inside toolbox

6. Running the Cluster

6.1 Setup & Verify