The problem
Frontier MoE models are gated by VRAM. The only consumer hardware that fits the weights costs $15K+.
| Model | Q4 size | Fits in |
|---|---|---|
| Mixtral 8×22B | 80 GB | RTX PRO 6000 (97 GB HBM3e) |
| Qwen3.5-397B-A17B | 227 GiB | only MI300X ($15K) |
| DeepSeek-V3 671B | 377 GiB | 2× MI300X |
A $700 RX 9070 XT has 16 GB GDDR6. The full model doesn't fit. But the active weights — K of N experts per layer — are tiny. That's the opening.
The thesis
Stream the experts from NVMe. Each token activates K of several hundred experts; load just those K from the SSD on demand.
The key move is putting the OS page cache between VRAM and SSD as a third memory tier:
| Tier | Where | Bandwidth | Capacity | Managed by |
|---|---|---|---|---|
| 1 | VRAM (LRU expert cache) | HBM / GDDR (~1–5 TB/s) | --moe-cache-mb N | fork (explicit) |
| 2 | OS page cache (host RAM) | DDR5 (~80 GB/s) | whatever's free | kernel (free) |
| 3 | NVMe SSD (per-expert files) | 5–17 GB/s | terabytes | filesystem |
On RX 9070 XT, this makes streaming nearly invisible — 13% gap between cold-cache and warm-cache steady state. The rest of the gap to a workstation GPU is bandwidth, not streaming overhead. NVMe is 100× cheaper per GB than DRAM and scales to the full 671B-parameter frontier on a single MI300X.
The stack
Four layers, all MIT-licensed.
Layer 1 — Kernel primitive (open upstream as ROCm/aiter#3034)
A scattered-pointer fused MoE matvec for GGUF Q4_K_M. Stock AITER assumes all experts live in one contiguous (E, N, K) device buffer — true when all experts fit in VRAM. Streaming MoE has K active experts in K different LRU cache slots fed from SSD. The new kernel takes per-expert pointers and a token-to-slot remap directly, no per-dispatch slab copy.
Same Triton source compiles and runs unchanged on four architectures. 18/18 correctness tests pass on all four.
| Architecture | GPU | PR1 peak | PR2 peak (q4_flat) |
|---|---|---|---|
| CDNA 3 | MI300X | 695 GB/s | 525 GB/s |
| Blackwell | RTX PRO 6000 | 529 | 333 |
| RDNA 4 | RX 9070 XT | 270 | 187 |
| gfx1151 | Strix Halo APU | 121 | 77 |
Layer 2 — Format diversification + activation fusion
A companion kernel that adds a simpler 4-bit format ("q4_flat") and fuses gate + up + SwiGLU + down into a single dispatch. Reads ~2× bytes per dispatch in only ~1.5× the wall-time vs the matvec — fusion's shared x-load and single launch consistently pay off.
Layer 3 — Engine: license-clean llama.cpp fork
MIT-licensed, bolted onto stock llama.cpp without touching its core. New files in ggml/src/ggml-cuda/:
buft-moe-stream.{cu,cuh}— MoE-stream buffer typemoe-stream-source.{cu,cuh}— SSD-backed miss source with pinned-host stagingfused-moe-amd.{cu,cuh}—MUL_MAT_IDdispatcher into the AITER kerneltopk-moe.{cu,cuh}— router fusion
Plus ggml-moe-cache.{cpp,h} in ggml/src/ (Phase-1 LRU cache keyed by (layer_id, expert_id)) and tools/moe-pack/ for GGUF → per-(layer, projection) extraction.
Discovery — gfx1151 UMA fast-path. On Strix Halo's unified LPDDR5x, hipMalloc returns a CPU-accessible buffer with larger TLB pages than hipMallocHost, giving +27% GPU read bandwidth (220 vs 173 GB/s). We couldn't find this in AMD's HIP programming guide. Runtime-detected in the fork.
Layer 4 — Agent surface (zero glue)
Stock llama-server already speaks OpenAI-compatible /v1/chat/completions with SSE streaming, tool_calls, multi-turn role: "tool" results, and Anthropic Messages. Point any OpenAI client at llama-server --moe-stream-dir DIR and you have a tool-calling agent on a streaming MoE.
Numbers
All measured on the published fork. Qwen3-30B-A3B-Q4_K_M for cross-arch validation; frontier models on a DigitalOcean MI300X droplet.
Cross-arch validation — same model, same fork, same flags
| Tier | GPU | Memory | tg64 | tg128 | pp64 | Streams? |
|---|---|---|---|---|---|---|
| Datacenter | MI300X VF | 192 GB HBM3 | 33.5* | 38.4 | 44.1 | yes¹ |
| Datacenter (full cache) | MI300X VF | 192 GB HBM3 | 77.3 | 80.1 | 304 | no¹ |
| Workstation | RTX PRO 6000 | 97 GB HBM3e | 42.5 | 42.6 | 49.6 | no |
| Consumer NV | RTX 4090 | 24 GB GDDR6X | 40.5 | 46.1 | 42.7 | no |
| Consumer AMD | RX 9070 XT | 16 GB GDDR6 | 21.7 | 25.8 | 12.9 | required |
| APU | Strix Halo | 32 GB UMA LPDDR5x | 27.0 | 29.5 | 38.1 | no |
¹ MI300X cells on DigitalOcean GPU droplet (virtio NVMe, ROCm 7.2). Forced-streaming row (--moe-cache-mb 8192) bottlenecks on virtio SSD; full-cache row (--moe-cache-mb 18000) is HBM3-bound — roughly 1.9× the RTX 4090, in line with HBM3's bandwidth advantage over GDDR6X.
Frontier scale, single MI300X
| Model | Q4_K_M size | Active params | tg | pp64 | cache (MB) |
|---|---|---|---|---|---|
| Qwen3.5-397B-A17B | 227 GiB | 17 B (K=10/512) | 7.4 / 8.0 | 7.6 | 100 000 |
| Qwen3.5-397B-A17B (forced stream) | 227 GiB | 17 B | 1.9 / 1.9 | 4.0 | 8 192 |
| DeepSeek-V3 671B | 377 GiB | 37 B (K=8/256) | 0.36 (tg32) | 0.86 | 8 192 |
Active expert footprint: 218 GiB (Qwen) and 367 GiB (DSV3) of streamed expert tensors. Without the streaming buffer type, neither model fits a -ngl 99 config on this hardware — non-expert weights alone fit, but loading the expert table into HBM3 doesn't.
Multi-tenant on one MI300X
One model load. Eight concurrent users via --parallel 8 -c 8192 on Qwen3.5-397B with --moe-cache-mb 80000:
| Metric | Value |
|---|---|
| Aggregate throughput (8 sessions) | 17.5 tg/s |
| Per-session steady-state | 2.2 tg/s |
| Per-request p50 / p95 (256 tok) | 116.5 s / 116.9 s |
| Expert-cache hit rate (logged) | 80.4% |
| VRAM watermark | 94 GB / 192 GB |
Single-session tg128 = 8.0 → 8-session aggregate = 2.18× single. Multiple sessions amortize the same expert-stream I/O into more output tokens; the 80% cache hit rate is what makes 8-way batching pay back so cleanly.
Demo — tool-calling agent on streaming MoE
Stock llama-server with one extra flag (--moe-stream-dir) plus --jinja. Two API round-trips: model picks the tool, we feed back a fake JSON result, model writes a natural-language answer.
--moe-cache-mb 8192. Real terminal session, asciinema-rendered.Run it on your hardware
No live demo backend on this Space — by design. Instead, paste these into a shell on your AMD or NVIDIA GPU box. Total time from clone to first token: ~10 min on a fast SSD (excluding model download).
Prerequisites
- ROCm 7.2+ (the
rocm/dev-ubuntu-24.04:latestcontainer works for gfx1151; bare metal works for gfx942/gfx1201) cmake,ninja-build,git,libcurl4-openssl-dev- ~250 GB free SSD (per Qwen 30B), ~600 GB (per Qwen 397B), ~750 GB (per DSV3 671B)
Build the fork (~5 min)
# clone the MIT-licensed streaming-MoE fork
git clone -b feature/moe-expert-gpu-cache --depth=1 \
https://github.com/ssubbotin/llama.cpp.git
cd llama.cpp
# pick your AMD arch:
# gfx942 = MI300X
# gfx1100 = RX 7900 XTX
# gfx1201 = RX 9070 XT
# gfx1151 = Strix Halo APU
cmake -B build -G Ninja \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx942 \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_CURL=ON \
-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang \
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++
cmake --build build -j $(nproc)
Download a GGUF + pack experts (~5–30 min depending on model size)
# 30B example (smallest, fits on RX 9070 XT)
pip install -U "huggingface_hub[hf_xet]" hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 hf download \
unsloth/Qwen3-30B-A3B-GGUF \
--include "Q4_K_M/*" --local-dir ./models/qwen30b
# extract per-(layer, projection) expert files for streaming
mkdir -p ./experts/qwen30b
./build/bin/moe-pack \
./models/qwen30b/Q4_K_M/Qwen3-30B-A3B-Q4_K_M.gguf \
./experts/qwen30b
# → 144 files, ~17 GiB
Bench it
# Forced-streaming bench (cache < experts, OS pagecache helps)
./build/bin/llama-bench \
-m ./models/qwen30b/Q4_K_M/Qwen3-30B-A3B-Q4_K_M.gguf \
--moe-stream-dir ./experts/qwen30b \
--moe-cache-mb 8192 \
-ngl 99 \
-p 64 -n 64,128 -r 3
# Cache-fits-all bench (HBM-bound upper bound)
./build/bin/llama-bench \
-m ./models/qwen30b/Q4_K_M/Qwen3-30B-A3B-Q4_K_M.gguf \
--moe-stream-dir ./experts/qwen30b \
--moe-cache-mb 18000 \
-ngl 99 \
-p 64 -n 64,128 -r 3
Run it as a server
# OpenAI-compatible chat-completions on port 8080
./build/bin/llama-server \
-m ./models/qwen30b/Q4_K_M/Qwen3-30B-A3B-Q4_K_M.gguf \
--moe-stream-dir ./experts/qwen30b \
--moe-cache-mb 8192 \
-ngl 99 \
--host 0.0.0.0 --port 8080 \
--jinja \
-c 8192
# in another shell:
curl http://localhost:8080/v1/chat/completions \
-H 'content-type: application/json' \
-d '{"messages":[{"role":"user","content":"hello"}],"max_tokens":64}'
Frontier-scale models (MI300X-class hardware)
# Qwen3.5-397B (~244 GB GGUF, 218 GiB packed experts)
HF_HUB_ENABLE_HF_TRANSFER=1 hf download \
unsloth/Qwen3.5-397B-A17B-GGUF \
--include "Q4_K_M/*" --local-dir ./models/qwen397b
# moe-pack must run per-shard (split GGUFs hold metadata in shard 1 only)
mkdir -p ./experts/qwen397b
for s in 2 3 4 5 6; do
./build/bin/moe-pack \
./models/qwen397b/Q4_K_M/Qwen3.5-397B-A17B-Q4_K_M-0000${s}-of-00006.gguf \
./experts/qwen397b
done
./build/bin/llama-bench \
-m ./models/qwen397b/Q4_K_M/Qwen3.5-397B-A17B-Q4_K_M-00001-of-00006.gguf \
--moe-stream-dir ./experts/qwen397b \
--moe-cache-mb 100000 \
-ngl 99 \
-p 64 -n 64,128 -r 3
# expected: ~7.4 / 8.0 tg/s on MI300X
Prerequisites
- CUDA 12.4+ with NVCC, an Ada/Hopper/Blackwell GPU
cmake,ninja-build,git,libcurl4-openssl-dev
Build for CUDA
git clone -b feature/moe-expert-gpu-cache --depth=1 \
https://github.com/ssubbotin/llama.cpp.git
cd llama.cpp
# pick your CUDA arch:
# 89 = Ada (RTX 4090 / RTX 6000 Ada)
# 100 = Blackwell datacenter
# 120 = Blackwell consumer (RTX 50-series)
cmake -B build -G Ninja \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=89 \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_CURL=ON
cmake --build build -j $(nproc)
Same download / pack / bench / serve commands as the HIP tab.
The streaming buffer type, expert-cache, and --moe-stream-dir flag all work identically on NVIDIA — the AITER Triton kernel from PR #3034 also runs on Blackwell unchanged.
Artifacts
| Layer | Artifact | License |
|---|---|---|
| Kernel — open upstream | ROCm/aiter#3034 (scattered-pointer Q4_K_M MoE matvec) | MIT |
| Kernel — companion (PR2 ready) | ssubbotin/aiter @ streaming-moe-mlx-flat4 (q4_flat fused gate+up+SwiGLU+down) | MIT |
| Engine — primary deliverable | ssubbotin/llama.cpp @ feature/moe-expert-gpu-cache | MIT |
| Agent surface | Upstream llama-server — no glue | MIT |
Every artifact is MIT-licensed with explicit SPDX headers. The fork has not been opened upstream as a PR — llama.cpp's AGENTS.md blocks AI-assisted PRs, so the fork stands as the public deliverable. Anyone can clone, build, and run it.