Streaming-MoE Stack for AMD

Kernel → Engine → Agent. From MI300X to RX 9070 XT.

A 397-billion-parameter model on a 16 GB consumer GPU.

The problem

Frontier MoE models are gated by VRAM. The only consumer hardware that fits the weights costs $15K+.

ModelQ4 sizeFits in
Mixtral 8×22B80 GBRTX PRO 6000 (97 GB HBM3e)
Qwen3.5-397B-A17B227 GiBonly MI300X ($15K)
DeepSeek-V3 671B377 GiB2× MI300X

A $700 RX 9070 XT has 16 GB GDDR6. The full model doesn't fit. But the active weights — K of N experts per layer — are tiny. That's the opening.

The thesis

Stream the experts from NVMe. Each token activates K of several hundred experts; load just those K from the SSD on demand.

The key move is putting the OS page cache between VRAM and SSD as a third memory tier:

TierWhereBandwidthCapacityManaged by
1VRAM (LRU expert cache)HBM / GDDR (~1–5 TB/s)--moe-cache-mb Nfork (explicit)
2OS page cache (host RAM)DDR5 (~80 GB/s)whatever's freekernel (free)
3NVMe SSD (per-expert files)5–17 GB/sterabytesfilesystem

On RX 9070 XT, this makes streaming nearly invisible — 13% gap between cold-cache and warm-cache steady state. The rest of the gap to a workstation GPU is bandwidth, not streaming overhead. NVMe is 100× cheaper per GB than DRAM and scales to the full 671B-parameter frontier on a single MI300X.

The stack

Four layers, all MIT-licensed.

Layer 1 — Kernel primitive (open upstream as ROCm/aiter#3034)

A scattered-pointer fused MoE matvec for GGUF Q4_K_M. Stock AITER assumes all experts live in one contiguous (E, N, K) device buffer — true when all experts fit in VRAM. Streaming MoE has K active experts in K different LRU cache slots fed from SSD. The new kernel takes per-expert pointers and a token-to-slot remap directly, no per-dispatch slab copy.

Same Triton source compiles and runs unchanged on four architectures. 18/18 correctness tests pass on all four.

ArchitectureGPUPR1 peakPR2 peak (q4_flat)
CDNA 3MI300X695 GB/s525 GB/s
BlackwellRTX PRO 6000529333
RDNA 4RX 9070 XT270187
gfx1151Strix Halo APU12177

Layer 2 — Format diversification + activation fusion

A companion kernel that adds a simpler 4-bit format ("q4_flat") and fuses gate + up + SwiGLU + down into a single dispatch. Reads ~2× bytes per dispatch in only ~1.5× the wall-time vs the matvec — fusion's shared x-load and single launch consistently pay off.

Layer 3 — Engine: license-clean llama.cpp fork

MIT-licensed, bolted onto stock llama.cpp without touching its core. New files in ggml/src/ggml-cuda/:

Plus ggml-moe-cache.{cpp,h} in ggml/src/ (Phase-1 LRU cache keyed by (layer_id, expert_id)) and tools/moe-pack/ for GGUF → per-(layer, projection) extraction.

Discovery — gfx1151 UMA fast-path. On Strix Halo's unified LPDDR5x, hipMalloc returns a CPU-accessible buffer with larger TLB pages than hipMallocHost, giving +27% GPU read bandwidth (220 vs 173 GB/s). We couldn't find this in AMD's HIP programming guide. Runtime-detected in the fork.

Layer 4 — Agent surface (zero glue)

Stock llama-server already speaks OpenAI-compatible /v1/chat/completions with SSE streaming, tool_calls, multi-turn role: "tool" results, and Anthropic Messages. Point any OpenAI client at llama-server --moe-stream-dir DIR and you have a tool-calling agent on a streaming MoE.

Numbers

All measured on the published fork. Qwen3-30B-A3B-Q4_K_M for cross-arch validation; frontier models on a DigitalOcean MI300X droplet.

Cross-arch validation — same model, same fork, same flags

TierGPUMemorytg64tg128pp64Streams?
DatacenterMI300X VF192 GB HBM333.5*38.444.1yes¹
Datacenter (full cache)MI300X VF192 GB HBM377.380.1304no¹
WorkstationRTX PRO 600097 GB HBM3e42.542.649.6no
Consumer NVRTX 409024 GB GDDR6X40.546.142.7no
Consumer AMDRX 9070 XT16 GB GDDR621.725.812.9required
APUStrix Halo32 GB UMA LPDDR5x27.029.538.1no

¹ MI300X cells on DigitalOcean GPU droplet (virtio NVMe, ROCm 7.2). Forced-streaming row (--moe-cache-mb 8192) bottlenecks on virtio SSD; full-cache row (--moe-cache-mb 18000) is HBM3-bound — roughly 1.9× the RTX 4090, in line with HBM3's bandwidth advantage over GDDR6X.

Frontier scale, single MI300X

ModelQ4_K_M sizeActive paramstgpp64cache (MB)
Qwen3.5-397B-A17B227 GiB17 B (K=10/512)7.4 / 8.07.6100 000
Qwen3.5-397B-A17B (forced stream)227 GiB17 B1.9 / 1.94.08 192
DeepSeek-V3 671B377 GiB37 B (K=8/256)0.36 (tg32)0.868 192

Active expert footprint: 218 GiB (Qwen) and 367 GiB (DSV3) of streamed expert tensors. Without the streaming buffer type, neither model fits a -ngl 99 config on this hardware — non-expert weights alone fit, but loading the expert table into HBM3 doesn't.

Multi-tenant on one MI300X

One model load. Eight concurrent users via --parallel 8 -c 8192 on Qwen3.5-397B with --moe-cache-mb 80000:

MetricValue
Aggregate throughput (8 sessions)17.5 tg/s
Per-session steady-state2.2 tg/s
Per-request p50 / p95 (256 tok)116.5 s / 116.9 s
Expert-cache hit rate (logged)80.4%
VRAM watermark94 GB / 192 GB

Single-session tg128 = 8.0 → 8-session aggregate = 2.18× single. Multiple sessions amortize the same expert-stream I/O into more output tokens; the 80% cache hit rate is what makes 8-way batching pay back so cleanly.

Demo — tool-calling agent on streaming MoE

Stock llama-server with one extra flag (--moe-stream-dir) plus --jinja. Two API round-trips: model picks the tool, we feed back a fake JSON result, model writes a natural-language answer.

RX 9070 XT, Qwen3-30B-A3B-Q4_K_M, --moe-cache-mb 8192. Real terminal session, asciinema-rendered.

Run it on your hardware

No live demo backend on this Space — by design. Instead, paste these into a shell on your AMD or NVIDIA GPU box. Total time from clone to first token: ~10 min on a fast SSD (excluding model download).

Prerequisites

  • ROCm 7.2+ (the rocm/dev-ubuntu-24.04:latest container works for gfx1151; bare metal works for gfx942/gfx1201)
  • cmake, ninja-build, git, libcurl4-openssl-dev
  • ~250 GB free SSD (per Qwen 30B), ~600 GB (per Qwen 397B), ~750 GB (per DSV3 671B)

Build the fork (~5 min)

# clone the MIT-licensed streaming-MoE fork
git clone -b feature/moe-expert-gpu-cache --depth=1 \
    https://github.com/ssubbotin/llama.cpp.git
cd llama.cpp

# pick your AMD arch:
#   gfx942  = MI300X
#   gfx1100 = RX 7900 XTX
#   gfx1201 = RX 9070 XT
#   gfx1151 = Strix Halo APU
cmake -B build -G Ninja \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS=gfx942 \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=ON \
  -DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang \
  -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++

cmake --build build -j $(nproc)

Download a GGUF + pack experts (~5–30 min depending on model size)

# 30B example (smallest, fits on RX 9070 XT)
pip install -U "huggingface_hub[hf_xet]" hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 hf download \
    unsloth/Qwen3-30B-A3B-GGUF \
    --include "Q4_K_M/*" --local-dir ./models/qwen30b

# extract per-(layer, projection) expert files for streaming
mkdir -p ./experts/qwen30b
./build/bin/moe-pack \
    ./models/qwen30b/Q4_K_M/Qwen3-30B-A3B-Q4_K_M.gguf \
    ./experts/qwen30b
# → 144 files, ~17 GiB

Bench it

# Forced-streaming bench (cache < experts, OS pagecache helps)
./build/bin/llama-bench \
    -m ./models/qwen30b/Q4_K_M/Qwen3-30B-A3B-Q4_K_M.gguf \
    --moe-stream-dir ./experts/qwen30b \
    --moe-cache-mb 8192 \
    -ngl 99 \
    -p 64 -n 64,128 -r 3

# Cache-fits-all bench (HBM-bound upper bound)
./build/bin/llama-bench \
    -m ./models/qwen30b/Q4_K_M/Qwen3-30B-A3B-Q4_K_M.gguf \
    --moe-stream-dir ./experts/qwen30b \
    --moe-cache-mb 18000 \
    -ngl 99 \
    -p 64 -n 64,128 -r 3

Run it as a server

# OpenAI-compatible chat-completions on port 8080
./build/bin/llama-server \
    -m ./models/qwen30b/Q4_K_M/Qwen3-30B-A3B-Q4_K_M.gguf \
    --moe-stream-dir ./experts/qwen30b \
    --moe-cache-mb 8192 \
    -ngl 99 \
    --host 0.0.0.0 --port 8080 \
    --jinja \
    -c 8192

# in another shell:
curl http://localhost:8080/v1/chat/completions \
    -H 'content-type: application/json' \
    -d '{"messages":[{"role":"user","content":"hello"}],"max_tokens":64}'

Frontier-scale models (MI300X-class hardware)

# Qwen3.5-397B (~244 GB GGUF, 218 GiB packed experts)
HF_HUB_ENABLE_HF_TRANSFER=1 hf download \
    unsloth/Qwen3.5-397B-A17B-GGUF \
    --include "Q4_K_M/*" --local-dir ./models/qwen397b

# moe-pack must run per-shard (split GGUFs hold metadata in shard 1 only)
mkdir -p ./experts/qwen397b
for s in 2 3 4 5 6; do
    ./build/bin/moe-pack \
        ./models/qwen397b/Q4_K_M/Qwen3.5-397B-A17B-Q4_K_M-0000${s}-of-00006.gguf \
        ./experts/qwen397b
done

./build/bin/llama-bench \
    -m ./models/qwen397b/Q4_K_M/Qwen3.5-397B-A17B-Q4_K_M-00001-of-00006.gguf \
    --moe-stream-dir ./experts/qwen397b \
    --moe-cache-mb 100000 \
    -ngl 99 \
    -p 64 -n 64,128 -r 3
# expected: ~7.4 / 8.0 tg/s on MI300X

Prerequisites

  • CUDA 12.4+ with NVCC, an Ada/Hopper/Blackwell GPU
  • cmake, ninja-build, git, libcurl4-openssl-dev

Build for CUDA

git clone -b feature/moe-expert-gpu-cache --depth=1 \
    https://github.com/ssubbotin/llama.cpp.git
cd llama.cpp

# pick your CUDA arch:
#   89 = Ada (RTX 4090 / RTX 6000 Ada)
#   100 = Blackwell datacenter
#   120 = Blackwell consumer (RTX 50-series)
cmake -B build -G Ninja \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=89 \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=ON

cmake --build build -j $(nproc)

Same download / pack / bench / serve commands as the HIP tab.

The streaming buffer type, expert-cache, and --moe-stream-dir flag all work identically on NVIDIA — the AITER Triton kernel from PR #3034 also runs on Blackwell unchanged.

Artifacts

LayerArtifactLicense
Kernel — open upstreamROCm/aiter#3034 (scattered-pointer Q4_K_M MoE matvec)MIT
Kernel — companion (PR2 ready)ssubbotin/aiter @ streaming-moe-mlx-flat4 (q4_flat fused gate+up+SwiGLU+down)MIT
Engine — primary deliverablessubbotin/llama.cpp @ feature/moe-expert-gpu-cacheMIT
Agent surfaceUpstream llama-server — no glueMIT

Every artifact is MIT-licensed with explicit SPDX headers. The fork has not been opened upstream as a PR — llama.cpp's AGENTS.md blocks AI-assisted PRs, so the fork stands as the public deliverable. Anyone can clone, build, and run it.