Local AI

The latest and greatest in open-source AI models of all kinds.

North Mini Code 1.0: Cohere's First Coding MoE

huggingface.coJun 9, 2026

30B total / 3B active MoE (128 experts, 8 active), 256K ctx. SWE-Bench Verified 67.6%, SWE-Bench Pro 40.2%, LiveCodeBench v6 70.3%. Beats Gemma4 26B-A4B and Devstral Small 2 on agentic coding. Two-stage SFT + RLVR on real repos. Apache 2.0. vLLM + Transformers.

4Jun 9, 2026, 6:22 PM

SCAIL-2: End-to-End Character Animation Without Skeleton Maps (Z.ai)

github.comJun 9, 2026

Z.ai open-sources SCAIL-2 (Wan 2.1, 14B, Apache 2.0): end-to-end character animation with no skeleton maps or inpainting masks. Human2Any, Any2Any (animals, cartoons), cross-identity replacement, multi-character. Emergent: animal-driving, SAM3D-Body mesh zero-shot. ComfyUI day-0 (Comfy-Org/SCAIL-2).

7Jun 9, 2026, 3:59 PM

Apple CoreAI: Open-Source On-Device Inference for Apple Silicon

github.comJun 8, 2026

Apple open-sources CoreAI (BSD 3-Clause) at WWDC 2026 — a capable alternative to CoreML for on-device inference. Supports Qwen3, Qwen3 MoE, Gemma 3, Mistral, FLUX.2 klein, Whisper, SAM 3. Python export → .aimodel. INT4 quant, dynamic KV cache (macOS). Requires macOS/iOS 27+.

4Jun 9, 2026, 1:33 PM

llama.cpp WebGPU: K-Quant Prefill up to 3.78x Faster

github.comJun 8, 2026

Merged June 8: ggml-webgpu now extracts 4 quant values per u32 instead of 1 for k-quants. M2 pro pp512: Q2_K 817→1991 t/s (2.44x), Q3_K 92→302 (3.27x), Q4_K 243→327 (1.34x), Q6_K 216→311 (1.44x). Qwen3.5 4B, Gemma 4 E4B tested.

6Jun 9, 2026, 3:38 AM

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

www.lucebox.comJun 8, 2026

Pins the experts your traffic actually hits on GPU, swaps the cold tail async from RAM. Qwen3.6 35B-A3B: 13.3 GiB (was ~20.5); Laguna XS.2 33B-A3B: 14.6 GiB (was 18.8) — both under 16 GiB. ~100 tok/s (92% of 24GB ceiling). Self-tunes from live traffic. Apache 2.0: dflash_server <model.gguf> --spark

6Jun 8, 2026, 5:46 PM

llama.cpp: Video Input Support Merged

github.comJun 8, 2026

PR #24269 merged: feed local video files to Qwen3-VL or Gemma 4 via llama-server and mtmd-cli. ffmpeg subprocess (must install separately). Web UI auto-enables via /chat/completions; --video flag on CLI. Lazy bitmap expands video into timestamped frames. Audio planned for future.

5Jun 8, 2026, 3:21 PM

llama.cpp: Gemma 4 MTP Merged into Main

github.comJun 7, 2026

PR #23398 merged: Gemma 4 12B and 31B (dense) get MTP in llama.cpp main. DGX Spark 31B: 6→15 tok/s avg (2.5x, up to 5x on translation). RTX 4070 Super + QAT: 140 tok/s on 12GB. ~0.58 draft accept rate. --spec-type draft-mtp --spec-draft-n-max 4. MoE (26B-A4B) less uplift; E2B/E4B not yet.

6Jun 7, 2026, 2:39 PM

Magenta RealTime 2: Google DeepMind's Open On-Device Streaming Music Model

huggingface.coJun 3, 2026

Google DeepMind open weights (Apache 2.0 / CC-BY-4.0): the only open model for real-time, continuous on-device music generation at ~200ms latency. Text + audio + MIDI control. SpectroStream 48kHz stereo codec + decoder-only LLM. Base: 2.4B, Small: 230M. github.com/magenta/magenta-realtime

8Jun 6, 2026, 9:32 PM

NVIDIA Nemotron 3.5 ASR: Multilingual Streaming Speech Recognition

huggingface.coJun 4, 2026

NVIDIA open-weight multilingual streaming ASR: 600M, 40 locales from one model. Cache-aware encoder recomputes nothing; 17x more concurrent streams vs Parakeet RNNT 1.1B at 80ms. Native punctuation + capitalization, auto language detect. 80ms to 1.12s latency. NeMo. OpenMDW-1.1.

8Jun 6, 2026, 7:10 PM

dots.tts: 2B Fully Continuous AR TTS from RedNote

github.comJun 5, 2026

2B fully continuous AR TTS from RedNote (Apache 2.0): Qwen2.5-1.5B backbone + AR flow-matching head, no discrete tokens, 48kHz. SOTA on Seed-TTS-Eval (avg WER 2.95%, SIM 79.2%). Best avg SIM (83.9) across 24 languages. Zero-shot voice cloning. Fine-tuning code included.

9Jun 6, 2026, 2:14 PM