AI - Artificial intelligence

290 readers
2 users here now

AI related news and articles.

Rules:

founded 1 year ago
MODERATORS
51
52
53
54
55
56
57
 
 

A recent paper compares two ways of getting AI models to "think" through hard problems. One is the classic chain-of-thought approach where the model writes out its reasoning steps in words. The other is latent thought where the model does the extra thinking internally, in its hidden states, without spitting out tokens. The authors did a rigorous theoretical analysis plus some experiments, and there are a couple of interesting takeaways.

If a problem can be split up into independent pieces that get combined later like evaluating a big math expression, checking if two nodes in a graph are connected, or computing edit distance, latent thought can process all pieces at the same depth in one shot. Chain-of-thought, on the other hand, has to go step by step through every single operation, which takes a lot more steps.

The paper proves this by connecting these reasoning styles to circuit complexity classes. Essentially, latent thought with a small number of loops can simulate deep parallel circuits, while chain-of-thought with the same small number of steps can't. So for highly parallel tasks, you'd rather have the model think silently in its embeddings than write a long chain of words.

The flip side is that chain-of-thought can do something latent thought can't which is that it can use stochastic decoding. Every time the model writes a new token, it's making a random choice based on probabilities. This allows chain-of-thought to run randomized algorithms that estimate hard counting problems, like figuring out how many ways there are to satisfy a DNF logic formula or sampling random graph colorings. Latent thought's internal steps are deterministic, so it can't inject that kind of randomness. The paper proves that under standard assumptions, there are approximate counting and sampling tasks where chain-of-thought has a provable advantage.

They tested on algorithmic tasks such as word problems in group theory, graph connectivity, arithmetic expression evaluation, edit distance. Latent thought reached high accuracy with far fewer iterations than chain-of-thought in all of them. For example, on connectivity, a looped transformer with 2 loops got 80% while CoT needed way more steps to catch up. However, on approximate counting and sampling tasks, chain-of-thought could estimate values and generate samples close to the true distribution, while latent thought just couldn't match that because it lacked the stochastic component.

So the core take away is that there's no universal best approach. If your problem is parallelizable, latent thought is dramatically faster in terms of reasoning iterations. If your problem needs randomized approximation, chain-of-thought is the way to go.

58
 
 

The hardware efficiency gains are honestly the most interesting part of the paper. The main reason DeepSeek-V4 is so cheap to run comes down to how they completely bypassed the quadratic cost of standard attention for massive context windows.

They built a hybrid attention architecture that interleaves Compressed Sparse Attention and Heavily Compressed Attention. Standard models keep every single token in the KV cache which absolutely kills memory. CSA fixes this by compressing the KV cache of multiple tokens into a single entry and then uses a sparse routing mechanism to only compute attention over the top-k most relevant compressed blocks. HCA takes it a step further by compressing an even larger number of tokens into one entry but computes dense attention over them. So, a 1.6T parameter Pro model only uses a third of the compute FLOPs and 10% of the KV cache memory compared to DeepSeek-V3.2 at a one million token context.

They also aggressively pushed low-precision formats applying FP4 quantization-aware training to the Mixture-of-Experts weights and the attention Query-Key paths. MoE models are notoriously memory bound because you have to constantly shuttle massive expert weights into the GPU cores. Dropping these to FP4 slashes the memory bandwidth bottleneck and lets the model run way faster during inference without ruining accuracy since they handle the quantization dynamically during training.

On the infrastructure side they wrote a custom fused kernel using TileLang that overlaps communication and computation. When running expert parallelism across multiple GPUs you usually hit a wall waiting for the network. DeepSeek slices the experts into micro-waves so the GPU is crunching matrix math on the first wave while the network is simultaneously pulling the data for the second wave. They basically hid the network latency behind the compute time which means you do not need super expensive interconnects to get peak hardware utilization out of the cluster.

59
 
 

Paywall Bypass Link https://archive.is/HikCi

60
61
62
63
64
 
 

Qwen-Scope paper is an interesting shift in how we handle mechanistic interpretability. The core idea here is moving sparse autoencoders from being just a post-hoc inspection tool to an actual interface for building and fixing language models. The team open-sourced 14 groups of SAEs for Qwen3 and Qwen3.5 architectures and demonstrated four practical ways to use them directly in the development pipeline.

First up is inference steering. Instead of just looking at what features activate when a model messes up, you can actively suppress or amplify those latent features to fix the output on the fly without updating any model weights. They showed an example where suppressing a specific Chinese language feature stopped the model from randomly mixing languages during an English prompt. They also proved you can trigger a classical literary style transfer just by turning on the right feature direction.

The evaluation finding is probably the most immediately useful for saving compute. They found that tracking the footprint of SAE features activated by a benchmark gives you a highly accurate proxy for dataset redundancy. If a bunch of reasoning problems activate the exact same micro-capability features, you can just sample a tiny subset of the benchmark and still get the exact same model ranking. Measuring feature overlap is also a reliable way to figure out if two different benchmarks are actually just testing the exact same capabilities before you waste time running full evaluations.

On the data curation side they proved you do not even need to train a classification head for things like toxicity. A simple logical rule over a few toxic-biased SAE features acts as a classifier and achieves an F1 score above 0.90. These toxic features discovered in English actually transfer quite well to other European languages. They also used this representation-level view for synthetic data generation by identifying safety features that were missing from the training distribution and prompting the model to generate examples that specifically trigger those missing internal directions.

Finally they integrated these latent features directly into supervised fine-tuning and reinforcement learning. In the fine-tuning stage they added an auxiliary loss to suppress language-specific features which heavily reduced unexpected code-switching. For reinforcement learning they intentionally amplified repetition features to force the policy model to generate endless loops. This gives the RL pipeline rare negative samples that are otherwise incredibly hard to encounter naturally and provides an explicit training signal against repetitive loops.

65
 
 

Apr 29, 2026

Nearly 120 civil society groups on Wednesday urged US lawmakers to reject Republican-led efforts to fast-track approval of artificial intelligence and conventional data centers, including by slipping provisions for these facilities into permitting reform legislation or “must-pass” bills.

Fossil fuel companies “are pushing to fast-track data center build-outs while ignoring the impacts on communities and the environment,” the groups said in a letter to congressional leaders. “Proposals disguised as ‘commonsense’ reforms would weaken the National Environmental Policy Act (NEPA), the Clean Water Act, the Clean Air Act, and the Endangered Species Act, while also stripping residents of their right to participate in decisions affecting their health, water, and air.”

66
67
 
 

A GGUF port of DFlash speculative decoding. Standalone C++/CUDA stack on top of ggml, runs on a single 24 GB RTX 3090, hosts the new Qwen3.6-27B.

~1.98x mean over autoregressive on Qwen3.6 across HumanEval / GSM8K / Math500, with zero retraining.

If you have CUDA 12+ and an NVIDIA GPU like RTX 3090 / 4090 / 5090, then all you need to do is

clone the repo

cd lucebox-hub/dflash cmake -B build -S . -DCMAKE_BUILD_TYPE=Release cmake --build build --target test_dflash -j

fetch target (~16 GB)

hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/

matched 3.6 draft is gated: accept terms + set HF_TOKEN first

hf download z-lab/Qwen3.6-27B-DFlash --local-dir models/draft/

run

DFLASH_TARGET=models/Qwen3.6-27B-Q4_K_M.gguf python3 scripts/run.py --prompt "def fibonacci(n):"

That's it. No Python runtime in the engine, no llama.cpp install, no vLLM, no SGLang.

Luce DFlash will:

  1. Load Qwen3.6-27B Q4_K_M target weights (~16 GB) plus the matched DFlash bf16 draft (~3.46 GB) and run DDTree tree-verify speculative decoding (block size 16, default budget 22, greedy verify).
  2. Compress the KV cache to TQ3_0 (3.5 bpv, ~9.7x vs F16) and roll a 4096-slot target_feat ring so 256K context fits in 24 GB. Q4_0 is the legacy path and tops out near 128K.
  3. Auto-bump the prefill ubatch from 16 to 192 for prompts past 2048 tokens (~913 tok/s prefill on 13K prompts).
  4. Apply sliding-window flash attention at decode (default 2048-token window, 100% speculative acceptance retained) so 60K context still decodes at 89.7 tok/s instead of 25.8 tok/s.
  5. Serve over an OpenAI-compatible HTTP endpoint or a local chat REPL.

Running on RTX 3090, Qwen3.6-27B UD-Q4_K_XL (unsloth Dynamic 2.0) target, 10 prompts/dataset, n_gen=256:

Bench AR tok/s DFlash tok/s AL Speedup

HumanEval 34.90 78.16 5.94 2.24x

Math500 35.13 69.77 5.15 1.99x

GSM8K 34.89 59.65 4.43 1.71x

Mean 34.97 69.19 5.17 1.98x

68
69
70
1
deepseek v4 (api-docs.deepseek.com)
71
72
 
 

[...]

That marketing may have outstripped reality. Early reports from Mythos preview users including AWS and Mozilla indicate that while the model is very good and very fast at finding vulnerabilities, and requires less hands-on guidance from security engineers - making it a welcome time-saver for the human teams - it has yet to eclipse human security researchers.

"So far we've found no category or complexity of vulnerability that humans can find that this model can't," Mozilla CTO Bobby Holley said, after revealing that Mythos found 271 vulnerabilities in Firefox 150. Then he added: "We also haven't seen any bugs that couldn't have been found by an elite human researcher." In other words, it's like adding an automated security researcher to y

73
74
75
view more: ‹ prev next ›