Local AI

Q1_0 CUDA kernels merged in b8806, follow-up to the CPU-only backend (#21273). Bonsai 8B (1.07 GiB): 374 tok/s tg128; 4B (540 MB): 485 tok/s; 1.7B (231 MB): 626 tok/s — all on RTX 5090. Requires Turing MMA+. Works on some AMD GPUs. KLD vs FP16: 0.0005, 98.7% same top token.

8Apr 15, 2026, 11:18 PM