AI - Artificial intelligence

295 readers
1 users here now

AI related news and articles.

Rules:

founded 1 year ago
MODERATORS
176
177
 
 

Unlike many startups producing nothing but vaporware, Jim Keller's TensTorrent has actually delivered impressive-looking RISC-V AI accelerators, but there could be some trouble brewing. Starting with firmware version 19.5.0, the firm has now chopped the tensor core count on Blackhole p150 cards from 140 to 120, affecting both new cards and existing units already in customers' hands.

The news was apparently communicated to customers via email, with the same wording present on the firmware update's GitHub page. Tenstorrent isn't elaborating on why the change was made, leaving existing and potential buyers scratching their heads

178
179
180
181
182
183
 
 
  • Clone any voice with just a 3-second audio sample
  • Fine-tune parameters (temperature, top-k, top-p) with quality presets
  • Generate complete podcasts from just a topic – AI writes the script, assigns voices, and synthesizes everything
  • 10 languages supported (Korean, English, Chinese, Japanese, etc.

Currently uses gpt5.2 for script generation, but the architecture is modular – you can swap in any local LLM (Qwen, Llama, etc.) if you want fully local.

184
185
186
187
188
189
190
191
192
193
194
195
196
197
1
submitted 5 months ago* (last edited 5 months ago) by cm0002@lemmings.world to c/Aii@programming.dev
 
 

If you’ve ever trained or fine-tuned an LLM, you’ve likely hit a wall at the very last step: the Cross-Entropy Loss.

The culprit is the logit bottleneck. To predict the next token, we project a hidden state into a massive vocabulary space. For Llama 3 (128,256 tokens), the weight matrix alone is over 525 million parameters. While that’s only ~1GB in bfloat16, the intermediate logit tensor is the real issue. For large batches, it can easily exceed 80GB of VRAM just to compute a single scalar loss.

Optimising this layer is how libraries like Unsloth and Liger-Kernel achieve such massive memory reductions. In this article, we’ll build a fused Linear + Cross Entropy kernel from scratch in Triton. We will derive the math and implement a tiled forward and backward pass that slashes peak memory usage by 84%.

198
199
 
 

The paper argues that we have been wasting a lot of expensive GPU cycles by forcing transformers to relearn static things like names or common phrases through deep computation. Standard models do not have a way to just look something up so they end up simulating memory by passing tokens through layer after layer of feed forward networks. DeepSeek introduced a module called Engram which adds a dedicated lookup step for local N-gram patterns. It acts like a new way to scale a model that is separate from the usual compute heavy Mixture of Experts approach.

The architecture uses multi head hashing to grab static embeddings for specific token sequences which are then filtered through a context aware gate to make sure they actually fit the current situation. They found a U shaped scaling law where the best performance happens when you split your parameter budget between neural computation and this static memory. By letting the memory handle the simple local associations the model can effectively act like it is deeper because the early layers are not bogged down with basic reconstruction.

One of the best bits is how they handle hardware constraints by offloading the massive lookup tables to host RAM. Since these lookups are deterministic based on the input tokens the system can prefetch the data from the CPU memory before the GPU even needs it. This means you can scale to tens of billions of extra parameters with almost zero impact on speed since the retrieval happens while the previous layers are still calculating.

The benchmarks show that this pays off across the board especially in long context tasks where the model needs its attention focused on global details rather than local phrases. It turns out that even in math and coding the model gets a boost because it is no longer wasting its internal reasoning depth on things that should just be in a lookup table. Moving forward this kind of conditional memory could be a standard part of sparse models because it bypasses the physical memory limits of current hardware.

200
view more: ‹ prev next ›