Local AI

The latest and greatest in open-source AI models of all kinds.

North Mini Code 1.0: Cohere's First Coding MoE

30B total / 3B active MoE (128 experts, 8 active), 256K ctx. SWE-Bench Verified 67.6%, SWE-Bench Pro 40.2%, LiveCodeBench v6 70.3%. Beats Gemma4 26B-A4B and Devstral Small 2 on agentic coding. Two-stage SFT + RLVR on real repos. Apache 2.0. vLLM + Transformers.

4Jun 9, 2026, 6:22 PM
SCAIL-2: End-to-End Character Animation Without Skeleton Maps (Z.ai)

Z.ai open-sources SCAIL-2 (Wan 2.1, 14B, Apache 2.0): end-to-end character animation with no skeleton maps or inpainting masks. Human2Any, Any2Any (animals, cartoons), cross-identity replacement, multi-character. Emergent: animal-driving, SAM3D-Body mesh zero-shot. ComfyUI day-0 (Comfy-Org/SCAIL-2).

7Jun 9, 2026, 3:59 PM
Apple CoreAI: Open-Source On-Device Inference for Apple Silicon

Apple open-sources CoreAI (BSD 3-Clause) at WWDC 2026 — a capable alternative to CoreML for on-device inference. Supports Qwen3, Qwen3 MoE, Gemma 3, Mistral, FLUX.2 klein, Whisper, SAM 3. Python export → .aimodel. INT4 quant, dynamic KV cache (macOS). Requires macOS/iOS 27+.

4Jun 9, 2026, 1:33 PM

Merged June 8: ggml-webgpu now extracts 4 quant values per u32 instead of 1 for k-quants. M2 pro pp512: Q2_K 817→1991 t/s (2.44x), Q3_K 92→302 (3.27x), Q4_K 243→327 (1.34x), Q6_K 216→311 (1.44x). Qwen3.5 4B, Gemma 4 E4B tested.

6Jun 9, 2026, 3:38 AM
Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax

Pins the experts your traffic actually hits on GPU, swaps the cold tail async from RAM. Qwen3.6 35B-A3B: 13.3 GiB (was ~20.5); Laguna XS.2 33B-A3B: 14.6 GiB (was 18.8) — both under 16 GiB. ~100 tok/s (92% of 24GB ceiling). Self-tunes from live traffic. Apache 2.0: dflash_server <model.gguf> --spark

6Jun 8, 2026, 5:46 PM

PR #24269 merged: feed local video files to Qwen3-VL or Gemma 4 via llama-server and mtmd-cli. ffmpeg subprocess (must install separately). Web UI auto-enables via /chat/completions; --video flag on CLI. Lazy bitmap expands video into timestamped frames. Audio planned for future.

5Jun 8, 2026, 3:21 PM
llama.cpp: Gemma 4 MTP Merged into Main

PR #23398 merged: Gemma 4 12B and 31B (dense) get MTP in llama.cpp main. DGX Spark 31B: 6→15 tok/s avg (2.5x, up to 5x on translation). RTX 4070 Super + QAT: 140 tok/s on 12GB. ~0.58 draft accept rate. --spec-type draft-mtp --spec-draft-n-max 4. MoE (26B-A4B) less uplift; E2B/E4B not yet.

6Jun 7, 2026, 2:39 PM
NVIDIA Nemotron 3.5 ASR: Multilingual Streaming Speech Recognition

NVIDIA open-weight multilingual streaming ASR: 600M, 40 locales from one model. Cache-aware encoder recomputes nothing; 17x more concurrent streams vs Parakeet RNNT 1.1B at 80ms. Native punctuation + capitalization, auto language detect. 80ms to 1.12s latency. NeMo. OpenMDW-1.1.

8Jun 6, 2026, 7:10 PM
dots.tts: 2B Fully Continuous AR TTS from RedNote

2B fully continuous AR TTS from RedNote (Apache 2.0): Qwen2.5-1.5B backbone + AR flow-matching head, no discrete tokens, 48kHz. SOTA on Seed-TTS-Eval (avg WER 2.95%, SIM 79.2%). Best avg SIM (83.9) across 24 languages. Zero-shot voice cloning. Fine-tuning code included.

9Jun 6, 2026, 2:14 PM