this post was submitted on 11 Apr 2026

66 points (83.0% liked)

Selfhosted

60707 readers

258 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Detailed Rules Post

Be civil.
No spam.
Posts are to be related to self-hosting.
Don't duplicate the full text of your blog or readme if you're providing a link.
Submission headline should match the article title.
No trolling.
Promotion posts require active participation, with an account that is at least 30 days old. F/LOSS without a paywall has exceptions, with requirements. See the rules link for details. Tags [CBH] or [AIP] are required, see the links in Rule 8 for details.
AI-related discussions and AI-involved promotional posts have additional requirements for tagging, as noted in Rule 7 and the AI & Promotional Post Expanded Rules post, and find example disclosures here.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago

MODERATORS

curbstickle@anarchist.nexus

curbstickle_lw@lemmy.world

In search for a new self-hosted LLM (lemmy.ml)

submitted 3 months ago* (last edited 3 months ago) by tanka@lemmy.ml to c/selfhosted@lemmy.world

31 comments fedilink hide all child comments

Hey :) For a while now I use gpt-oss-20b on my home lab for lightweight coding tasks and some automation. I'm not so up to date with the current self-hosted LLMs and since the model I'm using was released at the beginning of August 2025 (From an LLM development perspective, it feels like an eternity to me) I just wanted to use the collective wisdom of lemmy to maybe replace my model with something better out there.

Edit:

Specs:

GPU: RTX 3060 (12GB vRAM)

RAM: 64 GB

gpt-oss-20b does not fit into the vRAM completely but it partially offloaded and is reasonably fast (enough for me)

top 31 comments

sorted by: hot top controversial new old

[–] Jozzo@lemmy.world 35 points 3 months ago (1 children)

I find Qwen3.5 is the best at toolcalling and agent use, otherwise Gemma4 is a very solid all-rounder and it should be the first you try. Tbh gpt-oss is still good to this day, are you running into any problems w it?

[–] tanka@lemmy.ml 7 points 3 months ago (1 children)

No problems per se. I just thought that I had not checked for an update for a longer time.

[–] jacksilver@lemmy.world 2 points 3 months ago

You're probably aware, but updating the model periodically is probably a good idea just because things do change overtime.

A model from two years ago was trained on data from at least two years ago. Meaning any technology, code, world event changes wouldn't be reflected in the model.

[–] ejs@piefed.social 21 points 3 months ago

I suggest looking at llm arena leaderboards filtered by open weight models. It offers benchmarks at a very complete and statistically detailed level for models, and usually is quite up to date when new models come out. The new Gemma that just came out might be the best for 1x GPU, and if you have a bunch of vram check out the larger Chinese models

[–] Gumus@lemmy.dbzer0.com 13 points 3 months ago

I'd say Qwen 3.5 and Gemma 4 beat GPT OSS in every aspect.

[–] iceberg314@slrpnk.net 11 points 3 months ago (1 children)

I also recommend gemma4 or qwen3.5. Both super solid in my experience for how lightweight they are

[–] NoFun4You@lemmy.world 2 points 3 months ago (1 children)

Still can't get my gemma to give me complete unbuggy components

[–] iceberg314@slrpnk.net 1 points 3 months ago

I guess I have been using gemma4 fro more role playing games. Qwen3.5 seems to be better coder actually

[–] tal@lemmy.today 9 points 3 months ago

I'm not on there, but you might have more luck in !localllama@sh.itjust.works

You might also want to list the hardware that you plan to use, since that'll constrain what you can reasonably run.

[–] cron@feddit.org 6 points 3 months ago

The latest open weights model from google might be a good fit for you. The 26B model works pretty well on my machine, though the performance isn't great (6 tokens per second, CPU only).

[–] zorflieg@lemmy.world 5 points 3 months ago* (last edited 3 months ago)

Gemma4 e4b quant8 will fit in 12gb and is good

[–] jaschen306@sh.itjust.works 4 points 3 months ago

I'm running gemma4 26b MOE for most of my agent calls. I use glm5:cloud for my development agent because 26b struggles when the context windows gets too big.

[–] carzian@lemmy.ml 3 points 3 months ago

I'm in the same boat. You'll get better responses if you post your machine specs. I

[–] SuspciousCarrot78@lemmy.world 3 points 3 months ago* (last edited 3 months ago)

What sort of coding and what sort of automation tasks? The latter is an easier ticket to fill than the former, though I might have an idea for you on that end if coding is a must

[–] DieserTypMatthias@lemmy.ml 3 points 3 months ago

Qwen is pretty good. Also try LFM models.

[–] nutbutter@discuss.tchncs.de 2 points 3 months ago

Have you tried the new gemma4 models? The e4b fits in the 12gb memory and is pretty good. Or you can use 31b too, if you're okay with offloading to CPU.

[–] sompreno@lemmy.zip 2 points 3 months ago (1 children)

What are your computer specs?

[–] tanka@lemmy.ml 2 points 3 months ago (1 children)

I did just update my post with the specs. Maybe it takes a while to federate?

[–] sompreno@lemmy.zip 1 points 3 months ago

I must have not refreshed ignore my comment

[–] Kirk@startrek.website 2 points 3 months ago (3 children)

Just curious, what does "some automation" entail? I thought LLMs could only work with text, like summarize documents and that sort of thing.

[–] Jozzo@lemmy.world 6 points 3 months ago (1 children)

It's done by software using an LLM, not just a raw LLM. They do only work with text, but you can get it to output the text "get_weather(mylocation)", and instead of just outputting that directly to the user, the software running on top of the LLM runs a " get_weather" function that calls some weather API. The result of that function is then output to the user.

Any time you see an "AI" taking "actions", this is what happens in the background for every action.

[–] SuspciousCarrot78@lemmy.world 2 points 3 months ago* (last edited 3 months ago)

^ exactly that.

Also, I suspect that's the reason for Claude famously telling everyone to "go to bed" all the time. That bastich cannot run time and date as a background check reliably...it wings it based on start of conversation. Bitch I type a lot and fast....stop tellling me to go to bed at 9pm.

I expect it will get patched soon.

An endearing quirk....but it exposes the wiring if you know. Still, doesn't make the trick any less impressive when it hits.

[–] SuspciousCarrot78@lemmy.world 5 points 3 months ago* (last edited 3 months ago) (1 children)

Some examples

Tell Home Assistant to adjust lights/thermostat/locks in plain English based on certain conditions being met
Ask Jellyfin/Plex to play something based on a vague description like "something like Interstellar but lighter"
Morning briefing that pulls calendar, weather, emails and traffic into a 60-second summary automatically. Or get it to read it to you out loud while you shave.
Schedule the robot mower or vacuum based on weather forecast via API
Fetch information for you off net at set intervals and update you (email, SMS etc)
CCTV uses (classification etc)
Batch rename files, sort downloads, resize images - stuff you'd normally write a one-off script for
Parse a booking reply email, confirm the time, add it to your calendar, set reminders
Tag and name your own pictures based on meta data

That's probably just the basics. People have some clever uses for these things. It's not just summarize this document

[–] Kirk@startrek.website 1 points 3 months ago* (last edited 3 months ago) (1 children)

That's cool, it just... does those things? How does it connect to those apps? I can't even get Gemini to set a reminder and that's on a Google device.

[–] SuspciousCarrot78@lemmy.world 3 points 3 months ago* (last edited 3 months ago) (1 children)

Good question. Short answer: not quite.

The LLM is the reasoning layer. It reads your input, figures out intent, and outputs structured instructions. They have a method that achieves that (MCP).

Something else like Home Assistant, n8n, a Python script, whatever you've set up actually executes the actions. The LLM interacts with those things.

So for the calendar example: your email client triggers on a booking reply, passes the text to the LLM, the LLM extracts the date/time/location and outputs something structured, and then your automation tool creates the calendar event and sets the reminder. Once it's set up, it looks and feels like one thing, because you interact with it via the LLM (or even better - you vocally tell the LLM. Yes, JARVIS).

So the LLM never "talks to" Google Calendar directly, it just does the bit that's hard to do with traditional code, which is reading messy natural language and making sense of it.

Same for Home Assistant. The LLM parses "turn the lights down a bit, it's movie time, play something sci-fi" into a device + action + value, and HA does the actual switching.

The secret sauce that makes this work is MCP (Model Context Protocol) - basically a standardised way for LLMs to talk to tools and services.

Instead of custom glue code for every integration, you wire up an MCP server once and the model knows how to use it.

Growing library of them now: filesystems, calendars, browsers, databases, smart home etc.

Anthropic open-sourced the spec, most major local LLM frontends support it.

Think of it like hiring a translator who can manage your crew, rather than hiring someone who speaks every language and also has keys to every building and is also a plumber/electrician/contractor/interior designer, if that makes sense.

TL;DR: once you set up the stack, then the cool automation stuff can happen. Not a big ask, just a bit fiddly, like learning to program your VCR.

Super surprised Google's AI doesn't have the stack / harness inbuilt tho. They could afford to do a lot of the heavy lifting invisibly. I bet they actually do and it's just ... shit. Or a paid extra lol.

[–] Kirk@startrek.website 2 points 3 months ago (1 children)

That was actually super helpful, thank you.

[–] SuspciousCarrot78@lemmy.world 1 points 3 months ago* (last edited 3 months ago)

You're welcome :) This video discusses / shows some automation ideas, using a prebuilt kit you can grab

https://www.youtube.com/watch?v=WrreIi8LCiw

I have no affiliation with them (other than i thought of the exact same idea he did, about 12 months after him, lol). You can see it in action from about 3 mins in.

It's not the most impressive stuff possible, but the side by side screen give you and idea of what's under the hood.

[–] a1studmuffin@aussie.zone 3 points 3 months ago (1 children)

These days they can also chain together tools, keep a working memory etc. Look at Claude Code if you're curious. It's come very far very quickly in the last 12 months.

[–] Kirk@startrek.website 1 points 3 months ago

OP said coding AND "some automation", what is being automated?

[–] Evotech@lemmy.world 2 points 3 months ago* (last edited 3 months ago)

I’d use some Chinese model. Qwen3.5 Claude 4.6 distilled ablitirated is what I use

[–] theunknownmuncher@lemmy.world 1 points 3 months ago

How much VRAM?