llama.cpp WebGPU: K-Quant Prefill up to 3.78x Faster

Merged June 8: ggml-webgpu now extracts 4 quant values per u32 instead of 1 for k-quants. M2 pro pp512: Q2_K 817→1991 t/s (2.44x), Q3_K 92→302 (3.27x), Q4_K 243→327 (1.34x), Q6_K 216→311 (1.44x). Qwen3.5 4B, Gemma 4 E4B tested.