Streaming-MoE Stack for AMD — From MI300X to RX 9070 XT

The problem

Frontier MoE models are gated by VRAM. The only consumer hardware that fits the weights costs $15K+.

Model	Q4 size	Fits in
Mixtral 8×22B	80 GB	RTX PRO 6000 (97 GB HBM3e)
Qwen3.5-397B-A17B	227 GiB	only MI300X ($15K)
DeepSeek-V3 671B	377 GiB	2× MI300X

A $700 RX 9070 XT has 16 GB GDDR6. The full model doesn't fit. But the active weights — K of N experts per layer — are tiny. That's the opening.

The thesis

Stream the experts from NVMe. Each token activates K of several hundred experts; load just those K from the SSD on demand.

The key move is putting the OS page cache between VRAM and SSD as a third memory tier:

Tier	Where	Bandwidth	Capacity	Managed by
1	VRAM (LRU expert cache)	HBM / GDDR (~1–5 TB/s)	`--moe-cache-mb N`	fork (explicit)
2	OS page cache (host RAM)	DDR5 (~80 GB/s)	whatever's free	kernel (free)
3	NVMe SSD (per-expert files)	5–17 GB/s	terabytes	filesystem

On RX 9070 XT, this makes streaming nearly invisible — 13% gap between cold-cache and warm-cache steady state. The rest of the gap to a workstation GPU is bandwidth, not streaming overhead. NVMe is 100× cheaper per GB than DRAM and scales to the full 671B-parameter frontier on a single MI300X.

The stack

Four layers, all MIT-licensed.

Layer 1 — Kernel primitive (open upstream as ROCm/aiter#3034)

A scattered-pointer fused MoE matvec for GGUF Q4_K_M. Stock AITER assumes all experts live in one contiguous (E, N, K) device buffer — true when all experts fit in VRAM. Streaming MoE has K active experts in K different LRU cache slots fed from SSD. The new kernel takes per-expert pointers and a token-to-slot remap directly, no per-dispatch slab copy.

Same Triton source compiles and runs unchanged on four architectures. 18/18 correctness tests pass on all four.

Architecture	GPU	PR1 peak	PR2 peak (q4_flat)
CDNA 3	MI300X	695 GB/s	525 GB/s
Blackwell	RTX PRO 6000	529	333
RDNA 4	RX 9070 XT	270	187
gfx1151	Strix Halo APU	121	77

Layer 2 — Format diversification + activation fusion

A companion kernel that adds a simpler 4-bit format ("q4_flat") and fuses gate + up + SwiGLU + down into a single dispatch. Reads ~2× bytes per dispatch in only ~1.5× the wall-time vs the matvec — fusion's shared x-load and single launch consistently pay off.

Layer 3 — Engine: license-clean llama.cpp fork

MIT-licensed, bolted onto stock llama.cpp without touching its core. New files in ggml/src/ggml-cuda/:

buft-moe-stream.{cu,cuh} — MoE-stream buffer type
moe-stream-source.{cu,cuh} — SSD-backed miss source with pinned-host staging
fused-moe-amd.{cu,cuh} — MUL_MAT_ID dispatcher into the AITER kernel
topk-moe.{cu,cuh} — router fusion

Plus ggml-moe-cache.{cpp,h} in ggml/src/ (Phase-1 LRU cache keyed by (layer_id, expert_id)) and tools/moe-pack/ for GGUF → per-(layer, projection) extraction.

Discovery — gfx1151 UMA fast-path. On Strix Halo's unified LPDDR5x, hipMalloc returns a CPU-accessible buffer with larger TLB pages than hipMallocHost, giving +27% GPU read bandwidth (220 vs 173 GB/s). We couldn't find this in AMD's HIP programming guide. Runtime-detected in the fork.

Layer 4 — Agent surface (zero glue)

Stock llama-server already speaks OpenAI-compatible /v1/chat/completions with SSE streaming, tool_calls, multi-turn role: "tool" results, and Anthropic Messages. Point any OpenAI client at llama-server --moe-stream-dir DIR and you have a tool-calling agent on a streaming MoE.

Numbers

All measured on the published fork. Qwen3-30B-A3B-Q4_K_M for cross-arch validation; frontier models on a DigitalOcean MI300X droplet.

Cross-arch validation — same model, same fork, same flags

Tier	GPU	Memory	tg64	tg128	pp64	Streams?
Datacenter	MI300X VF	192 GB HBM3	33.5*	38.4	44.1	yes¹
Datacenter (full cache)	MI300X VF	192 GB HBM3	77.3	80.1	304	no¹
Workstation	RTX PRO 6000	97 GB HBM3e	42.5	42.6	49.6	no
Consumer NV	RTX 4090	24 GB GDDR6X	40.5	46.1	42.7	no
Consumer AMD	RX 9070 XT	16 GB GDDR6	21.7	25.8	12.9	required
APU	Strix Halo	32 GB UMA LPDDR5x	27.0	29.5	38.1	no

¹ MI300X cells on DigitalOcean GPU droplet (virtio NVMe, ROCm 7.2). Forced-streaming row (--moe-cache-mb 8192) bottlenecks on virtio SSD; full-cache row (--moe-cache-mb 18000) is HBM3-bound — roughly 1.9× the RTX 4090, in line with HBM3's bandwidth advantage over GDDR6X.

Frontier scale, single MI300X

Model	Q4_K_M size	Active params	tg	pp64	cache (MB)
Qwen3.5-397B-A17B	227 GiB	17 B (K=10/512)	7.4 / 8.0	7.6	100 000
Qwen3.5-397B-A17B (forced stream)	227 GiB	17 B	1.9 / 1.9	4.0	8 192
DeepSeek-V3 671B	377 GiB	37 B (K=8/256)	0.36 (tg32)	0.86	8 192

Active expert footprint: 218 GiB (Qwen) and 367 GiB (DSV3) of streamed expert tensors. Without the streaming buffer type, neither model fits a -ngl 99 config on this hardware — non-expert weights alone fit, but loading the expert table into HBM3 doesn't.

Multi-tenant on one MI300X

One model load. Eight concurrent users via --parallel 8 -c 8192 on Qwen3.5-397B with --moe-cache-mb 80000:

Metric	Value
Aggregate throughput (8 sessions)	17.5 tg/s
Per-session steady-state	2.2 tg/s
Per-request p50 / p95 (256 tok)	116.5 s / 116.9 s
Expert-cache hit rate (logged)	80.4%
VRAM watermark	94 GB / 192 GB

Single-session tg128 = 8.0 → 8-session aggregate = 2.18× single. Multiple sessions amortize the same expert-stream I/O into more output tokens; the 80% cache hit rate is what makes 8-way batching pay back so cleanly.

Demo — tool-calling agent on streaming MoE

Stock llama-server with one extra flag (--moe-stream-dir) plus --jinja. Two API round-trips: model picks the tool, we feed back a fake JSON result, model writes a natural-language answer.

RX 9070 XT, Qwen3-30B-A3B-Q4_K_M, --moe-cache-mb 8192. Real terminal session, asciinema-rendered.

Run it on your hardware

No live demo backend on this Space — by design. Instead, paste these into a shell on your AMD or NVIDIA GPU box. Total time from clone to first token: ~10 min on a fast SSD (excluding model download).

Prerequisites

ROCm 7.2+ (the rocm/dev-ubuntu-24.04:latest container works for gfx1151; bare metal works for gfx942/gfx1201)
cmake, ninja-build, git, libcurl4-openssl-dev
~250 GB free SSD (per Qwen 30B), ~600 GB (per Qwen 397B), ~750 GB (per DSV3 671B)

Build the fork (~5 min)

# clone the MIT-licensed streaming-MoE fork
git clone -b feature/moe-expert-gpu-cache --depth=1 \
    https://github.com/ssubbotin/llama.cpp.git
cd llama.cpp

# pick your AMD arch:
#   gfx942  = MI300X
#   gfx1100 = RX 7900 XTX
#   gfx1201 = RX 9070 XT
#   gfx1151 = Strix Halo APU
cmake -B build -G Ninja \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS=gfx942 \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=ON \
  -DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang \
  -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++

cmake --build build -j $(nproc)

Download a GGUF + pack experts (~5–30 min depending on model size)

# 30B example (smallest, fits on RX 9070 XT)
pip install -U "huggingface_hub[hf_xet]" hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 hf download \
    unsloth/Qwen3-30B-A3B-GGUF \
    --include "Q4_K_M/*" --local-dir ./models/qwen30b

# extract per-(layer, projection) expert files for streaming
mkdir -p ./experts/qwen30b
./build/bin/moe-pack \
    ./models/qwen30b/Q4_K_M/Qwen3-30B-A3B-Q4_K_M.gguf \
    ./experts/qwen30b
# → 144 files, ~17 GiB

Bench it

# Forced-streaming bench (cache < experts, OS pagecache helps)
./build/bin/llama-bench \
    -m ./models/qwen30b/Q4_K_M/Qwen3-30B-A3B-Q4_K_M.gguf \
    --moe-stream-dir ./experts/qwen30b \
    --moe-cache-mb 8192 \
    -ngl 99 \
    -p 64 -n 64,128 -r 3

# Cache-fits-all bench (HBM-bound upper bound)
./build/bin/llama-bench \
    -m ./models/qwen30b/Q4_K_M/Qwen3-30B-A3B-Q4_K_M.gguf \
    --moe-stream-dir ./experts/qwen30b \
    --moe-cache-mb 18000 \
    -ngl 99 \
    -p 64 -n 64,128 -r 3

Run it as a server

# OpenAI-compatible chat-completions on port 8080
./build/bin/llama-server \
    -m ./models/qwen30b/Q4_K_M/Qwen3-30B-A3B-Q4_K_M.gguf \
    --moe-stream-dir ./experts/qwen30b \
    --moe-cache-mb 8192 \
    -ngl 99 \
    --host 0.0.0.0 --port 8080 \
    --jinja \
    -c 8192

# in another shell:
curl http://localhost:8080/v1/chat/completions \
    -H 'content-type: application/json' \
    -d '{"messages":[{"role":"user","content":"hello"}],"max_tokens":64}'

Frontier-scale models (MI300X-class hardware)

# Qwen3.5-397B (~244 GB GGUF, 218 GiB packed experts)
HF_HUB_ENABLE_HF_TRANSFER=1 hf download \
    unsloth/Qwen3.5-397B-A17B-GGUF \
    --include "Q4_K_M/*" --local-dir ./models/qwen397b

# moe-pack must run per-shard (split GGUFs hold metadata in shard 1 only)
mkdir -p ./experts/qwen397b
for s in 2 3 4 5 6; do
    ./build/bin/moe-pack \
        ./models/qwen397b/Q4_K_M/Qwen3.5-397B-A17B-Q4_K_M-0000${s}-of-00006.gguf \
        ./experts/qwen397b
done

./build/bin/llama-bench \
    -m ./models/qwen397b/Q4_K_M/Qwen3.5-397B-A17B-Q4_K_M-00001-of-00006.gguf \
    --moe-stream-dir ./experts/qwen397b \
    --moe-cache-mb 100000 \
    -ngl 99 \
    -p 64 -n 64,128 -r 3
# expected: ~7.4 / 8.0 tg/s on MI300X

Prerequisites

CUDA 12.4+ with NVCC, an Ada/Hopper/Blackwell GPU
cmake, ninja-build, git, libcurl4-openssl-dev

Build for CUDA

git clone -b feature/moe-expert-gpu-cache --depth=1 \
    https://github.com/ssubbotin/llama.cpp.git
cd llama.cpp

# pick your CUDA arch:
#   89 = Ada (RTX 4090 / RTX 6000 Ada)
#   100 = Blackwell datacenter
#   120 = Blackwell consumer (RTX 50-series)
cmake -B build -G Ninja \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=89 \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=ON

cmake --build build -j $(nproc)

Same download / pack / bench / serve commands as the HIP tab.

The streaming buffer type, expert-cache, and --moe-stream-dir flag all work identically on NVIDIA — the AITER Triton kernel from PR #3034 also runs on Blackwell unchanged.

Artifacts

Layer	Artifact	License
Kernel — open upstream	ROCm/aiter#3034 (scattered-pointer Q4_K_M MoE matvec)	MIT
Kernel — companion (PR2 ready)	ssubbotin/aiter @ streaming-moe-mlx-flat4 (q4_flat fused gate+up+SwiGLU+down)	MIT
Engine — primary deliverable	ssubbotin/llama.cpp @ feature/moe-expert-gpu-cache	MIT
Agent surface	Upstream `llama-server` — no glue	MIT

Every artifact is MIT-licensed with explicit SPDX headers. The fork has not been opened upstream as a PR — llama.cpp's AGENTS.md blocks AI-assisted PRs, so the fork stands as the public deliverable. Anyone can clone, build, and run it.