llama.cpp: Q1_0 1-Bit Quantization Gets CUDA Backend

Q1_0 CUDA kernels merged in b8806, follow-up to the CPU-only backend (#21273). Bonsai 8B (1.07 GiB): 374 tok/s tg128; 4B (540 MB): 485 tok/s; 1.7B (231 MB): 626 tok/s — all on RTX 5090. Requires Turing MMA+. Works on some AMD GPUs. KLD vs FP16: 0.0005, 98.7% same top token.