KingRandomGuy

joined 3 years ago
[–] KingRandomGuy@lemmy.world 3 points 1 month ago

Yeah I can believe their interconnect is better, given their extensive history in networking.

W.r.t TFLOPs, let me clarify what I meant. Even on traditionally compute-bound workloads (attention, etc.), on H200 it's actually surprisingly difficult to make full use of the card's throughput before hitting VRAM bandwidth limits. Tensor core throughput has grown a lot faster than bandwidth has.

I've never written a kernel for Huawei chips so I have no idea if they have the same problem. But this problem is there on many datacenter-class NVIDIA chips, which is why they keep introducing features (TMA, TMEM, etc.) to try and lower the time wasted waiting for memory.

[–] KingRandomGuy@lemmy.world 2 points 1 month ago

You can actually get kind of acceptable performance on CPU alone, but you need rather specific CPUs, like SPR or newer Intel Xeons. These support AMX, which is almost like a mini tensor core, so you can actually get decent throughput in TFLOPs out of GNR Xeons. Memory bandwidth with max channels is also acceptable, something like ~800 GB/s per socket with maxed out MRDIMMs, which is not too far behind consumer GPUs like 3090 and 4090.

Not anywhere near the performance of real GPUs of course, and not something acceptable for scale or production workloads, but good enough for local inference.

[–] KingRandomGuy@lemmy.world 1 points 1 month ago (1 children)

Makes sense, even Flash is fairly sizable! KTransformers also has a "llamafile" backend which uses GGUFs, but ik_llama will almost certainly perform better if you're not on a NUMA setup. In my case, I'm using a dual socket motherboard, so KTransformers performs quite a bit better (I think ik_llama hasn't implemented extensive NUMA optimizations quite yet, but sounds like it's coming), though I normally use KTransformers for native FP8 weights.

[–] KingRandomGuy@lemmy.world 3 points 1 month ago (3 children)

Yeah, I'd expect KTransformers to add support eventually, especially considering their existing support for previous DeepSeek models. One of the tricky parts is that backends need both FP8 and MXFP4 support. As far as I'm aware no inference engine supports both on CPU at the moment (llama.cpp added fp4 support recently, but doesn't have fp8, while kt-kernel doesn't support fp4 yet).

[–] KingRandomGuy@lemmy.world 8 points 1 month ago (2 children)

To be fair, the raw FLOPs count doesn't tell the whole story. On a lot of workloads (including token generation during LLM inference), you're bound by the memory bandwidth rather than throughput/FLOPs. On H100/H200, keeping the tensor cores fully occupied is surprisingly difficult, and that's with 3+ TB/s of memory bandwidth. And I believe those cards have much higher throughput (at least at FP8, Ascend wins at FP4 since H100/200 don't support it) compared to Ascend.

The Ascend 950PR units have far lower memory bandwidth, reportedly at 1.4 TB/s. Compare that to Blackwell, which has something like 8TB/s of bandwidth. I believe they're manufacturing their own kind of HBM, so that's still really impressive considering this is a fairly recent push into manufacturing accelerators. But I'm a bit skeptical it actually outperforms NVIDIA at scale.

[–] KingRandomGuy@lemmy.world 6 points 5 months ago* (last edited 5 months ago)

What info have you heard about Fenghua 3? I'd last read that it's not strictly an AI accelerator but can actually do graphics tasks, which is neat. Would make it more of a competitor to a professional workstation card like an RTX PRO 6000.

I'm most curious about their CUDA compatibility claim. I would expect that to cause a pretty significant performance hit since when writing high-performance CUDA kernels, you generally need to specialize the kernel to the individual GPU (an H100 kernel will look quite different compared to a 4090 kernel, for example). But if in spite of that it can achieve H100 performance, that'd be cool.

[–] KingRandomGuy@lemmy.world 3 points 5 months ago

Like others have mentioned, the spider (the wires) and the secondary do shadow some light that would otherwise reach the primary. It also results in some artifacts due to diffraction; the view ends up convolved with the Fourier transform of the aperture. This is why on Hubble images, you see cross shaped stars, as that's the shape of the Fourier transform of its 4-strut spider.

[–] KingRandomGuy@lemmy.world 26 points 5 months ago* (last edited 5 months ago)

Every time I see a headline like this I’m reminded of the time I heard someone describe the modern state of AI research as equivalent to the practice of alchemy.

Not sure if you're referencing the same thing, but this actually came from a presentation at NeurIPS 2017 (the largest and most prestigious machine learning/AI conference) for the "Test of Time Award." The presentation is available here for anyone interested. It's a good watch. The presenter/awardee, Ali Rahimi, talks about how over time, rigor and fundamental knowledge in the field of machine learning has taken a backseat compared to empirical work that we continue to build upon, yet don't fully understand.

Some of that sentiment is definitely still true today, and unfortunately, understanding the fundamentals is only going to get harder as empirical methods get more complex. It's much easier to iterate on empirical things by just throwing more compute at a problem than it is to analyze something mathematically.

[–] KingRandomGuy@lemmy.world 2 points 7 months ago (1 children)

I do research in 3D computer vision and in general, depth from cameras (even multi view) tends to be much noisier than LiDAR. LiDAR has the advantage of giving explicit depth, whereas with multiview cameras you need to compute it, which has a fair amount of failure modes. I think that's what the above user is getting at when they said Waymo actually has depth sensing.

This isn't to say that Tesla's approach can't work at all, but just that Waymo's is more grounded. There are reasons to avoid LiDAR (cost primarily, a good LiDAR sensor is very expensive), but if you can fit LiDAR into your stack it'll likely help a bit with reliability.

[–] KingRandomGuy@lemmy.world 2 points 8 months ago

Yeah I agree on these fronts. The hardware might be good but software frameworks need to support it, which historically has been very hit or miss.

[–] KingRandomGuy@lemmy.world 3 points 8 months ago (2 children)

Depends strongly on what ops the NPU supports IMO. I don't do any local gen AI stuff but I do use ML tools for image processing in photography (e.g. lightroom's denoise feature, GraXpert denoise and gradient extraction for astrophotography). These tools are horribly slow on CPU. If the NPU supports the right software frameworks and data types then it might be nice here.

[–] KingRandomGuy@lemmy.world 1 points 8 months ago

You're correct about all of this, but it's way easier to press print than machine a part from stock. I do some machining as well (I don't own the machines, but I'm trained on the mill, lathe, and waterjet in our shop). So most of the time if I can get away with a 3d printed part, it's worth it for the time savings alone. Plus sometimes the easiest or optimal geometry to design is not something that can be machined, but can be printed.

It's specific circumstances where the basic filaments fall short, like creep and heat resistance, irrespective of print parameters. ASA and PET-CF work well in most of these spots, so I don't do anything more exotic.

view more: next ›