this post was submitted on 22 Apr 2026
1 points (100.0% liked)

LocalLLaMA

4673 readers
18 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
 

Recently made a post about the 35b MOE. Now the dense 27b variant has been released.


you are viewing a single comment's thread
view the rest of the comments
[–] SuspciousCarrot78@lemmy.world 0 points 5 days ago (1 children)

Rules of thumb

  • For a 27B: if you want it to run entirely on your GPU, you will need to use a quantisation that fits + leave room for KV cache. So (for example), if your model GGUF was 10GB, I'd leave another 2GB for kv cache, meaning you'd need 12gb to run it with a reasonable context length. I haven't looked at the quants for Qwen3.6 27B yet...I imagine the "good baseline" quant is what...12? 15gb?

Having said that, remember that 1) you can run partially on CPU/GPU 2) use lower quants etc. So, if you have "just" 12GB, a lower quant (I dunno...IQ3_XS?) might get you over the line

  • You can run it however you want :) For someone brand new, the best all in one is Ollama or Jan.ai.

  • Yes. Jan.ai has MCP tooling (I imagine ollama does as well), so you can follow the how-to's to set that up. Read their docs? What do you need to do with MCP?

  • What you should know: you'll reach a point where "more parameters = better performance" needs to be balanced against cost and smarter tooling. Don't be tempted to drop $$$ on something thinking you can just throw money at the problem to make it go away.

[–] venusaur@lemmy.world 0 points 4 days ago (1 children)

Thanks! I’m experimenting with my laptop with 16GB RAM and no GPU/VRAM. I installed llama.cpp and am testing Gemma 7b Q5 but it’s not answering prompts correctly. It’s analyzing the prompt and not answering the question, or it gives me a poem haha. Trying to figure it out.

Any lightweight model you recommend for just chat experimenting for now? Can they connect to the internet?

[–] SuspciousCarrot78@lemmy.world 0 points 4 days ago* (last edited 4 days ago)

I'll never not recommend Qwen3-4B 2507 instruct...because despite being ancient in AI terms (so, 8 months lol) it's solid. Notably, the base models in Jan are all Qwen 3-4 variants.

Most models can search the web, if they have access to web searching tool.