robber

joined 3 years ago
[–] robber@lemmy.ml 5 points 1 month ago

Global sustainability rules???

[–] robber@lemmy.ml 0 points 2 months ago* (last edited 2 months ago) (1 children)

I don't follow the discussions on this topic very closely, but as I understood, there are different ways to achieve the goal, but all impact quality to some extent. Heretic is discussed as one one of the SOTA methods. The README posted above states the following, so it seems that heretic is some sort of next gen abliteration.

It combines an advanced implementation of directional ablation, also known as "abliteration" (Arditi et al. 2024, Lai 2025 (1, 2)), with a TPE-based parameter optimizer powered by Optuna.

 

The wait is over, most ggufs are already up. Nice to see there's models for many different hardware configurations.

[–] robber@lemmy.ml 0 points 2 months ago* (last edited 2 months ago)

Yeah I enjoy it as well. Just in case you missed it - a fix was merged into llama.cpp two days ago which is said to improve quality.

Edit: I stand corrected - the fix for the issue you're experiencing has not yet been merged.

 

It's out!

[–] robber@lemmy.ml 4 points 5 months ago

Exactly this. Since it does not seem to be federated, you're still forced to give your data to a third party you can't choose. And this makes the open source aspect a rather marginal benefit, at least for the privacy-concerned end user. Still, I appreaciate the effort.

[–] robber@lemmy.ml 16 points 5 months ago (7 children)

I haven't tried it, but there is one: https://github.com/Alovoa/alovoa

[–] robber@lemmy.ml 10 points 6 months ago (2 children)

Given that Google generated more than 250 billion U.S. dollars in ad revenue in 2024, I'd say they must be pretty effective.

Source

[–] robber@lemmy.ml 1 points 6 months ago (1 children)
[–] robber@lemmy.ml 0 points 6 months ago (1 children)

I see. When I run the inference engine containerized, will the container be able to run its own version of CUDA or use the host's version?

[–] robber@lemmy.ml 0 points 6 months ago

Thank you for taking the time to respond.

I've used vLLM for hosting a smaller model which could fit in two of GPUs, it was very performant especially for multiple requests at the same time. The major drawback for my setup was that it only supports tensor parallelism for 2, 4, 8, etc. GPUs and data paralellism slowed inference down considerably, at least for my cards. exllamav3 is the only engine I'm aware of which support 3-way TP.

But I'm fully with you in that vLLM seems to be the most recommended and battle-tested solution.

I might take a look at how I can safely upgrade the driver until I can afford a fourth card and switch back to vLLM.

[–] robber@lemmy.ml 0 points 6 months ago (2 children)

I use the the proprietary ones from Nvidia, they're at 535 on oldstable IIRC but there are a lot newer ones.

I use 3xRTX2000e Ada. It's a rather new, quite power efficient GPU manufactured by PNY.

As inference engine I use exllamav3 with tabbyAPI. I like it very much because it supports 3-way tensor paralellism, making it a lot faster for me than llamacpp.

 

Hey everyone! I was just skimming through some inference benchmarks of other people and noticed the driver version is usually mentioned. It made me wonder how relevant this is. My prod server runs Debian 12 so the packaged nvidia drivers are rather old, but I'd prefer not to mess with the drivers if it won't bring a benefit. Does any of you have any experience or did do some testing?

[–] robber@lemmy.ml 6 points 6 months ago

That brian typo really gave me a chuckle. Hope you found the movie you were looking for.

[–] robber@lemmy.ml 2 points 6 months ago (2 children)

Wikipedia states the UI layer is propriertary, is that true?

1
submitted 7 months ago* (last edited 7 months ago) by robber@lemmy.ml to c/localllama@sh.itjust.works
 

Title says it - it's been 10 days already but I didn't catch the release. This might be huge for those of us running on multiple GPUs. At least for Gemma3, I was able to double inference speed by using vLLM with tensor parallelism vs. ollama's homegrown parallelism. Support in ExLlamaV3 could additionally allow to pair TP with lower-bit quants. Haven't tested this yet, but I'm looking very much forward to.

 

Tencent recently released a new MoE model with ~80b parameters, 13b of which are active at inference. Seems very promising for people with access to 64 gigs of VRAM.

 

Hey fellow llama enthusiasts! Great to see that not all of lemmy is AI sceptical.

I'm in the process of upgrading my server with a bunch of GPUs. I'm really excited about the new Mistral / Magistral Small 3.2 models and would love to serve them for me and a couple of friends. My research led me to vLLM with which I was able to double inference speed compared to ollama at least for qwen3-32b-awq.

Now sadly, the most common quantization methods (GGUF, EXL, BNB) are either not fully (GGUF) or not at all (EXL) supported in vLLM, or multi-gpu inference thouth tensor parallelism is not supported (BNB). And especially for new models it's hard to find pre-quantized models in different, more broadly supported formats (AWQ, GPTQ).

Does any of you guys face a similar problem? Do you quantize models yourself? Are there any up-to-date guides you would recommend? Or did I completely overlook another, obvious solution?

It feels like when I've researched something yesterday, it's already outdated again today, since the landscape is so rapidly evolving.

Anyways, thank you for reading and sharing your thoughts or experience if you feel like it.

 

Text: Allows you to determine whether to limit CPUID maximum value. Set this to enabled for legacy operating systems such as Linux or Unix.

Found this in the BIOS of a Gigabyte Z97X-UD3H mobo.

 

Hi fellow homelabbers! I hope your day / night is going great.

Just stubled across this self-hosted cloudflare tunnel alternernative called Pangolin.

  • Does anyone use it for exposing their homelab? It looks awesome, but I've never heard of it before.

  • Should I be reluctant since it's developed by a US-based company? I mean security-wise. (I'll remove this question if it's too political.)

  • Does anyone know of alternatives pieces or stacks or software that achieve the same without relying on cloudflare?

Your insights are highly appreciated!

 

I've been looking into self-hosting LLMs or stable diffusion models using something like LocalAI and / or Ollama and LibreChat.

Some questions to get a nice discussion going:

  • Any of you have experience with this?
  • What are your motivations?
  • What are you using in terms of hardware?
  • Considerations regarding energy efficiency and associated costs?
  • What about renting a GPU? Privacy implications?
view more: next ›