this post was submitted on 16 Mar 2026
1 points (100.0% liked)

LocalLLaMA

4744 readers
2 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 3 years ago
MODERATORS
 

Key architectural details

Mixture of Experts (MoE): 128 experts, with 4 active per token, enabling efficient scaling and specialization.

119B total parameters, with 6B active parameters per token (8B including embedding and output layers).

256k context window, supporting long-form interactions and document analysis.

Configurable reasoning effort: Toggle between fast, low-latency responses and deep, reasoning-intensive outputs.

Native multimodality: Accepts both text and image inputs, unlocking use cases from document parsing to visual analysis.

top 8 comments
sorted by: hot top controversial new old
[–] panda_abyss@lemmy.ca 0 points 2 months ago* (last edited 2 months ago) (2 children)

Looks a little underwhelming with Qwen3.5 and Haiku beating it.

However, 6B active parameters and it's trained to return short results could make this useful as a Qwen for local model. I've overall found Mistral models to be better to discuss with, but also the devstral small models were kinda janky last I used them (stuff like infinite loops and getting confused by less common programming languages). Qwen models are by far the most verbose out of the box, and happily burn a ton of tokens on useless thought. It's an over-emphasis on reinforcement learning.

Also weird they use GPT 4.1 as the judge model. That's a year old model, not nearly SOTA, and IIRC underwhelmed on most metrics. So it feels like a poor candidate judge.

Edit: we have a GPT5 -- some of the charts are labelled wrong

Not mentioned in the blog post, but on HF: they created a small speculative decoding model go with it -- https://huggingface.co/mistralai/Mistral-Small-4-119B-2603-eagle

That should accelerate inference speeds on some setups.

[–] MalReynolds@slrpnk.net 0 points 2 months ago

For certain values of small...

That said, Mistral is strong in world knowledge and something this big is likely quite so. The 6B experts can fit in reasonable amounts of system RAM (Q4_K_M is ~ 72 GB so it'd likely run reasonably in 64 GB system RAM and 24 GB VRAM) and run at reasonable if not spectacular speeds, speculative decoding could help too (but that eagle is 392MB, which is scary tiny).

[–] Eyekaytee@aussie.zone 0 points 2 months ago

Qwen models are by far the most verbose out of the box, and happily burn a ton of tokens on useless thought. It’s an over-emphasis on reinforcement learning.

I now have a system prompt just to say please stop talking Qwen 😭

Even a hello can result in 3 paragraphs by default

[–] fubarx@lemmy.world 0 points 2 months ago (2 children)

At this point, these small models should add explicit minimum hardware requirements just so they can stand out. STM32 w xxGB of PSRAM. Android phone w this much RAM, how many TOPS, and minimum OS version. ESP32-S3 or S4? That sort of thing.

If you just say 'small,' you get lost in the noise.

[–] Eyekaytee@aussie.zone 0 points 2 months ago* (last edited 2 months ago)

tbh that's the main thing I took away from this, since when did small equal 119b ?!

Does that mean they've got large models lined up approaching 1tb?

[–] SuspciousCarrot78@lemmy.world 0 points 2 months ago* (last edited 2 months ago) (1 children)

Also: when the fuck did a 120B parameter model become "small"? I feel like I'm being gaslit here LOL.

Under 20B? Legit small.

EDIT to add: I have been thinking of running TTS on a ESP32....but that madness is competing side by side with wiring this up to my local LLM. https://github.com/poboisvert/GPTARS_Interstellar

[–] fubarx@lemmy.world 0 points 2 months ago

We are being gaslit. From the article:

Recommended setup: 4x NVIDIA HGX H100, 4x NVIDIA HGX H200, or 2x NVIDIA DGX B200 for optimal performance.

No big. Your typical homelab setup. 🙄

Also: https://github.com/jahrulnr/esp32-picoTTS

[–] TheFrirish@tarte.nuage-libre.fr 0 points 1 month ago

I have an 7900XTX, Ryzen 9 7950X3D with 96GB ram which I humbly believe is already way above 95% people's setup

I don't think I can run this not with ollama that's for sure