LocalLLaMA

4795 readers

2 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 3 years ago

MODERATORS

MonsterBug@sh.itjust.works

Smokeydope@lemmy.world

pax@sh.itjust.works

noneabove1182@sh.itjust.works

Anyone's using Intel Arc B70 Pro? (lemmy.dbzer0.com)

submitted 1 month ago by pound_heap@lemmy.dbzer0.com to c/localllama@sh.itjust.works

20 comments fedilink hide all child comments

32 GB VRAM for less $1k sounds like a steal these days, and I'm sure it's not getting cheaper any time soon.

Does anyone here use this GPU? Or any recent Arc Pros? I basically want someone to talk me out of driving to the nearest place that has it in stock and getting $1k poorer.

you are viewing a single comment's thread
view the rest of the comments

[–] afk_strats@lemmy.world 0 points 1 month ago* (last edited 1 month ago) (2 children)

This is something I learned the hard way.

Consumer hardware is limited by multiple factors when it comes to PCIe connectivity.

Physical layout. Easy how many slots you have to plu into, their size, and configuration.
Supported lanes from the CPU
chipset (motherboard) limitations

Your graphics card might be a 16 lane card (referred to as "x16"), but sometimes, not all of them are used. Aforementioned 5060ti - I believe only uses x8. Some devices like graphics cards can use a physically smaller slot with an adapter for a loss in performance (a few frames in game play performance)

Similarly, your motherboard might have a x16 slot and another x16 at the bottom. That second slot might only function as x8 or even x4. Does this matter? Sort of. Inta-card communication aka peer to peer communication can affect affect performance and that can compound with multiple cards.

Even worse, some motherboards may have all sorts of connectivity but may have limitations like only 2 out of the bottom 4 slots, PCIe and m.2, can work at a time. ASK ME HOW I KNOW.

Your CPU controls PCIe. It has a hard cap in how many PCIe devices it can handle and what speed. AMD tends to be better here.

Enterprise gear suffers from none of this bs. Enterprise CPUs have a ton of PCIe lanes and enterprise motherboards usually match the physical size of their PCIe slots to their capacity and support full bifurcation*

PCIe lanes are used up by and consumable by m.2, MCIO, and occulink to name a few. That means that you can connect a graphics card to either one is those of you can figure out the wires and power**

** Bonus: bifurcation and how my $200 consumer motherboard runs 6 graphics cards.

Bifurcation is a motherboard feature that lets you split PCIe capacity, so a 16x slot can support two x8 devices. My motherboard lets me do this on just the main slot and in a strange x8x4x4 configuration. I have an MCIO adapter (google it) which plugs into the PCIe and gives me 3 PCIe adapters with those corresponding speeds.

it also has 2 m.2 slots which connect to the CPU. One is them, I use for a nvme ssd like a normal person. The other is an m.2 to PCIe adapter which gives me an x4 PCIe slot. For those keeping track, that's 24 PCIe lanes so far. That's the maximum my processor Intel 265k can handle

But wait! The motherboard also has a kind of PCIe router and that thing can handle 8 more lanes! So I use the bottom 2 PCIe lanes on my motherboard for 2 cards at x4 each. The thing that kills me is that there are more m.2 ports. But the mobo will not be able to use any more than 2 devices at once. AND even though that bottom PCIe slot is sized at x16, electrically, its x4.

Do your research (level1techs is great) and read the manuals to really understand this stuff before you buy

My mobo for reference ASUS: TUF GAMING Z890-PRO WIFI

[–] lavember@programming.dev 0 points 1 month ago (1 children)

How reliable is this setup for local inference? For instance how many tokens/sec?

I'm asking cause I'd guess sharing bandwith like that would have some cost in speed

[–] afk_strats@lemmy.world 0 points 1 month ago (1 children)

I find llama.cpp with Vulkan EXTREMELY reliable. I can have it running for days at once without a problem. As far as tokens/sec that's that's a complicated question because it depends on model, quant, sepculative, kv quant, context length, and card distribution. Generally:

Models' typical speeds at deep context for agentic use. Simple chats will be faster

Model	Quant	Prompt Processing (tok/s)	Token Generation (tok/s)	Hardware	Quality
Qwen 3.5 397B	Q2_K_M	100-120	18-22	2 x 7900 + 4 x Mi50	★★★★★
Gemma4 31B or Qwen3.5 27B	Q8_0	400-800	20-25	2 x 7900xtx	★★★★
Qwen 3.6 35B	Q5_K_M	1000-2500	60-100	2 x 7900xtx	★★★★
Qwen 3.5 122B	Q4_0	200-300	30-35	4 x MI50	★★★★
gpt-oss 120b	mxfp4 (native)	500-800	50-60	3 x Mi50	★★
Nemotron 3 Nano 30B	IQ3_K_XXS	2500-3000	150-180	1 x 7900xtx	★

[–] lavember@programming.dev 0 points 1 month ago

that's sick, thanks for sharing

[–] pound_heap@lemmy.dbzer0.com 0 points 1 month ago (1 children)

Wow, I didn't think you were running 176GB worth of GPUs on a consumer board! I don't have an extra board, and my gaming PC that has 9070XT is not a good basis for multi GPU build - it has a cheap mATX motherboard with too few slots and lanes. So it's going to be a new build. Used EPYC boards look interesting for that.

[–] afk_strats@lemmy.world 0 points 1 month ago

I wish I bought an epyc board last year instead of my rig. Would have been far fewer headaches and, with the price of RAM, I would have quintupled in value now!