Introduction

Following my previous guides on vLLM setup and SGLang with MiniMax M2, this guide covers a very different kind of hardware: the AMD Ryzen AI Max+ 395 β€” codenamed β€œStrix Halo.”

Unlike discrete NVIDIA GPUs, this is an APU with up to 128GB of unified memory, meaning the integrated Radeon 8060S GPU can access a massive memory pool. This makes it possible to run large models locally without needing a dedicated GPU with massive VRAM. With the right kernel parameters, you can allocate up to 124GB of unified memory to the GPU.

In this guide, I’ll walk through setting up vLLM on Fedora using the amd-strix-halo-vllm-toolboxes project by kyuz0, which provides a pre-configured Toolbx container built on TheRock nightly ROCm builds for the gfx1151 architecture.

Why Strix Halo for LLMs?

The Ryzen AI Max+ 395 is a unique chip:

SpecificationValue
CPU Cores16 Zen 5 cores, up to 5.1 GHz
GPURadeon 8060S (40 RDNA 3.5 CUs, 2560 shaders)
NPUXDNA 2 (50 TOPS)
MemoryUp to 128GB LPDDR5X-8000 unified
GPU AllocatableUp to ~124GB via kernel parameters
Architecturegfx1151 (Strix Halo)

The key advantage is unified memory: unlike discrete GPUs with fixed VRAM, the GPU shares system RAM. With 128GB total, you can allocate 124GB to the GPU and still have 4GB reserved for the OS. This is enough to run models that would otherwise require multiple high-end GPUs.


Prerequisites

Hardware

  • AMD Ryzen AI Max+ 395 (Strix Halo) machine
  • 128GB RAM (LPDDR5X-8000)
  • Fedora 42/43 installed

Software

  • Fedora Linux with toolbox package
  • Podman (comes with Fedora)
  • SSH access to the machine

Step 1: Configure Kernel Parameters

The most critical step is configuring the kernel to allocate enough unified memory to the GPU. By default, the GPU only gets a small fraction of system RAM. These parameters tell the AMD GPU driver to reserve up to 124GB.

Edit /etc/default/grub and add these parameters to GRUB_CMDLINE_LINUX:

iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856
ParameterPurpose
iommu=ptSets IOMMU to β€œPass-Through” mode, reducing overhead for iGPU unified memory access
amdgpu.gttsize=126976Caps GPU unified memory to 124 GiB (126976 MiB Γ· 1024 = 124 GiB)
ttm.pages_limit=32505856Caps pinned memory to 124 GiB (32505856 Γ— 4 KiB = 126976 MiB = 124 GiB)

Apply and reboot:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

Source: These parameters come from community findings on Reddit r/LocalLLaMA and are documented in the amd-strix-halo-vllm-toolboxes repository.


Step 2: Connect and Update

SSH into your Strix Halo machine:

ssh-copy-id johannes@mini-ai
ssh johannes@mini-ai

Update the system and install prerequisites:

sudo dnf update
sudo dnf install git
sudo dnf install toolbox

Step 3: Clone the Toolbox Repository

The amd-strix-halo-vllm-toolboxes project provides a pre-built Fedora Toolbx container with vLLM and ROCm pre-installed, built on TheRock nightly builds:

git clone https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes.git
cd amd-strix-halo-vllm-toolboxes/

What is Toolbx?

Toolbx is a tool for using containerized CLI environments on Fedora. It shares your HOME directory and user, so models and configs persist on the host. This means:

  • Downloaded models stay in ~/.cache/huggingface on the host
  • No need to manage volumes or mounts
  • Easy to iterate while keeping the host system clean

What is TheRock?

TheRock is AMD’s open-source project providing nightly builds of the ROCm stack. The toolbox uses TheRock builds specifically compiled for the gfx1151 (Strix Halo) architecture, ensuring you get the latest ROCm support without waiting for official releases.


Step 4: Create the vLLM Toolbox

Run the included script to pull the latest image and create the toolbox:

./refresh_toolbox.sh

This outputs:

ℹ️  No InfiniBand devices detected.
πŸ”„ Refreshing vllm (image: docker.io/kyuz0/vllm-therock-gfx1151:latest)
⬇️ Pulling latest image: docker.io/kyuz0/vllm-therock-gfx1151:latest
Trying to pull docker.io/kyuz0/vllm-therock-gfx1151:latest...
Getting image source signatures
Copying blob 08b1c086214c done   |
Copying blob 4fe0fbde88ef done   |
...
Copying config 86d3955926 done   |
Writing manifest to image destination
86d39559264a71e316215bbfce11b14e027ebe923185873220476f766235eca4
πŸ“¦ Recreating toolbox: vllm
   Options: --device /dev/dri --device /dev/kfd --group-add video --group-add render --security-opt seccomp=unconfined
Created container: vllm
Enter with: toolbox enter vllm
βœ… vllm refreshed

What the Script Does

ActionPurpose
Pulls kyuz0/vllm-therock-gfx1151:latestGets the latest vLLM + ROCm container for gfx1151
Creates toolbox with --device /dev/dri --device /dev/kfdPasses GPU devices into the container
--group-add video --group-add renderGrants GPU access permissions
--security-opt seccomp=unconfinedRelaxes security for ROCm compute operations
Auto-detects InfiniBandIf IB devices are present, exposes them for RDMA clustering

Manual Creation: If you prefer to create the toolbox manually:

toolbox create vllm \
  --image docker.io/kyuz0/vllm-therock-gfx1151:latest \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --security-opt seccomp=unconfined

Step 5: Enter the Toolbox

toolbox enter vllm

You’ll be greeted with the vLLM splash screen:

β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•—β–ˆβ–ˆβ•—  β–ˆβ–ˆβ•—      β–ˆβ–ˆβ•—  β–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•—      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
β–ˆβ–ˆβ•”β•β•β•β•β•β•šβ•β•β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•      β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•—
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—   β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘ β•šβ–ˆβ–ˆβ–ˆβ•”β•       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘
β•šβ•β•β•β•β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•”β–ˆβ–ˆβ•—       β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β• β–ˆβ–ˆβ•—      β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•
β•šβ•β•β•β•β•β•β•   β•šβ•β•   β•šβ•β•  β•šβ•β•β•šβ•β•β•šβ•β•  β•šβ•β•      β•šβ•β•  β•šβ•β•β•šβ•β•  β•šβ•β•β•šβ•β•β•β•β•β•β• β•šβ•β•β•β•β•β•

                               v L L M

AMD STRIX HALO β€” vLLM Toolbox (gfx1151, ROCm via TheRock)
ROCm nightly: 7.13.60800

Machine: Default string H1
GPU    : AMD RYZEN AI MAX+ 395 w/ Radeon 8060S

Repo   : https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes
Image  : docker.io/kyuz0/vllm-therock-gfx1151:latest

Included:
  - start-vllm (TUI) β†’ Interactive launcher: Model select, Multi-GPU & Cache handling
  - start-vllm-cluster β†’ Cluster launcher: Setup Ray Head/Worker & Launch vLLM RCCL
  - vllm-cluster-bench β†’ Cluster Benchmark: TP=2, Auto-detected Env, JSON Results
  - vLLM server      β†’ vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct
  - API test         β†’ curl localhost:8000/v1/chat/completions

SSH tip: ssh -L 8000:localhost:8000 user@host

Included Tools

ToolDescription
start-vllmInteractive TUI wizard for launching models with pre-configured options
start-vllm-clusterCluster launcher for multi-node setups with Ray and RCCL
vllm-cluster-benchAutomated cluster benchmarking with JSON results

Step 6: Serve a Model

Option A: Using the TUI Wizard (Easiest)

The toolbox includes a TUI wizard called start-vllm with pre-configured models and flags:

start-vllm

This presents an interactive menu for model selection, cache handling, and launch configuration.

Option B: Manual Launch

For full control, launch vLLM directly. Here I’m serving Qwen3.6-35B-A3B β€” a Mixture-of-Experts model with 35B total parameters but only 3B active per forward pass, making it extremely efficient for the unified memory architecture:

vllm serve Qwen/Qwen3.6-35B-A3B \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-num-seqs 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95 \
  --dtype auto \
  --trust-remote-code \
  --attention-backend TRITON_ATTN

Launch Parameters Explained

ParameterValueDescription
--host 0.0.0.0Listen on all interfaces for remote access
--port 8000Standard OpenAI-compatible API port
--tensor-parallel-size 1Single GPU (unified memory, no TP needed)
--max-num-seqs 1One concurrent request (optimal for iGPU)
--max-model-len 3276832K context window
--gpu-memory-utilization 0.95Use 95% of GPU memory for model + KV cache
--dtype autoAuto-detect best dtype for the model
--trust-remote-codeRequired for Qwen models with custom code
--attention-backend TRITON_ATTNUse Triton attention backend (optimized for ROCm)

Why Qwen3.6-35B-A3B?

FeatureDetail
ArchitectureMixture-of-Experts (MoE)
Total Parameters35B
Active Parameters3B per forward pass
Context LengthUp to 262,144 tokens native
Model Size~24GB (BF16)
CapabilitiesMultimodal (vision + language), reasoning, coding, agents

The MoE architecture is ideal for the Strix Halo: the full 35B model fits comfortably in unified memory, but only 3B parameters are active during inference. This means you get near-large-model quality at small-model inference speed.


Step 7: Test the API

Once the server is running, test it with a curl request:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.6-35B-A3B",
    "messages": [
      {"role": "user", "content": "Hello! Test the performance."}
    ]
  }'

Remote Access via SSH

If the Strix Halo machine is remote, forward port 8000 via SSH:

ssh -L 8000:localhost:8000 johannes@mini-ai

Then access the API at http://localhost:8000 from your local machine.

Auto-detect the Active Model

If you don’t want to specify the model name:

MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id')
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"$MODEL\",
    \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}]
  }"

Step 8: Add a Web UI (Optional)

For a chat interface, use HuggingFace ChatUI with port forwarding:

# On your local machine (with SSH tunnel active)
docker run -p 3000:3000 \
  --add-host=host.docker.internal:host-gateway \
  -e OPENAI_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=dummy \
  -v chat-ui-data:/data \
  ghcr.io/huggingface/chat-ui-db

Then open http://localhost:3000 in your browser.


Tested Models

The amd-strix-halo-vllm-toolboxes project maintains a benchmark table of tested models. Here are some highlights:

ModelTPMax ContextGPU Util
meta-llama/Meta-Llama-3.1-8B-Instruct1128K0.95
google/gemma-3-12b-it1128K0.95
Qwen/Qwen3-14B-AWQ140K0.95
Qwen/Qwen3.6-35B-A3B132K+0.95
openai/gpt-oss-120b1128K0.95

Full benchmarks are available at kyuz0.github.io/amd-strix-halo-vllm-toolboxes.


Distributed Clustering (RDMA/RoCE)

The toolbox also supports clustering multiple Strix Halo nodes using InfiniBand or RoCE v2 (e.g., Intel E810). This enables Tensor Parallelism across machines with extremely low latency (~5Β΅s), effectively creating a single 256GB unified memory GPU from two nodes.

Key features:

  • Custom-built librccl.so for RDMA on gfx1151
  • refresh_toolbox.sh auto-detects and exposes RDMA devices
  • start-vllm-cluster TUI for managing Ray and vLLM

See the RDMA Cluster Setup Guide in the repository for details.


Troubleshooting

GPU Not Detected

Verify the GPU is visible inside the toolbox:

rocm-smi

If it fails, check:

  1. Kernel parameters are applied: cat /proc/cmdline | grep amdgpu.gttsize
  2. The toolbox was created with --device /dev/dri --device /dev/kfd
  3. Your user is in the video and render groups

Out of Memory

If vLLM crashes with OOM:

  • Reduce --gpu-memory-utilization to 0.90 or 0.85
  • Reduce --max-model-len to 16384 or 8192
  • Verify kernel parameters: amdgpu.gttsize=126976 ttm.pages_limit=32505856

Slow First Inference

The first inference is slow because vLLM compiles Triton kernels. Compiled kernels are cached in ~/.cache/vllm/ and subsequent runs will be faster.


Summary

Here’s the complete setup:

# 1. Configure kernel parameters
# Edit /etc/default/grub, add to GRUB_CMDLINE_LINUX:
#   iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

# 2. Install prerequisites
sudo dnf update
sudo dnf install git toolbox

# 3. Clone and create toolbox
git clone https://github.com/kyuz0/amd-strix-halo-vllm-toolboxes.git
cd amd-strix-halo-vllm-toolboxes/
./refresh_toolbox.sh

# 4. Enter toolbox and serve
toolbox enter vllm
vllm serve Qwen/Qwen3.6-35B-A3B \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 1 \
  --max-num-seqs 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95 \
  --dtype auto \
  --trust-remote-code \
  --attention-backend TRITON_ATTN

The AMD Ryzen AI Max+ 395 with unified memory is a compelling platform for local LLM inference. With 124GB of GPU-accessible memory, you can run models that would otherwise require expensive multi-GPU setups β€” all in a single machine.

Key takeaways:

  • Configure kernel parameters to unlock unified memory for the GPU
  • Use the Strix Halo toolbox for a pre-configured vLLM + ROCm environment
  • MoE models like Qwen3.6-35B-A3B are ideal β€” large total params, small active params
  • Use TRITON_ATTN backend for best ROCm performance
  • For remote access, use SSH port forwarding

Happy serving!


References

Share this post