Looks a little underwhelming with Qwen3.5 and Haiku beating it.
However, 6B active parameters and it's trained to return short results could make this useful as a Qwen for local model. I've overall found Mistral models to be better to discuss with, but also the devstral small models were kinda janky last I used them (stuff like infinite loops and getting confused by less common programming languages). Qwen models are by far the most verbose out of the box, and happily burn a ton of tokens on useless thought. It's an over-emphasis on reinforcement learning.
Also weird they use GPT 4.1 as the judge model. That's a year old model, not nearly SOTA, and IIRC underwhelmed on most metrics. So it feels like a poor candidate judge.
Edit: we have a GPT5 -- some of the charts are labelled wrong
Not mentioned in the blog post, but on HF: they created a small speculative decoding model go with it -- https://huggingface.co/mistralai/Mistral-Small-4-119B-2603-eagle
That should accelerate inference speeds on some setups.