AI - Artificial intelligence

1

GLM-5.2 is the step change for open agents (www.interconnects.ai)

submitted 1 day ago by cm0002@toast.ooo to c/Aii@programming.dev

0 comments fedilink

2

1

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference (arxiv.org)

submitted 2 days ago by cm0002@toast.ooo to c/Aii@programming.dev

0 comments fedilink

DualPath is a system developed by DeepSeek to address the storage input and output bottleneck that slows down agentic LLM inference. When LLMs run as agents they need to repeatedly interact with their environments over many turns which builds up a massive context history stored as a KV-Cache. Most current systems split the workload into prefill engines that process new prompt tokens and decode engines that generate the actual responses. The fundamental issue is that prefill engines have to load KV-Cache directly from external persistent storage which maxes out network bandwidth on the prefill side while the storage network connections on the decode engines sit idle.

DualPath creaties a second route for the data which allows the system to load KV-Cache from storage into the idle decoding engines first. Once the data hits the decode engines it gets forwarded to the prefill engines using a fast compute network connecting the graphics processing units. It's basically a routing strategy for aggregating the storage bandwidth across all the machines and stop the prefill nodes from becoming a choke point.

A traffic manager places the KV-Cache transfers onto a lower priority virtual lane so that the actual inference communication gets majority of the bandwidth priority while data shuffling happens in the background without causing latency spikes. A dynamic scheduler then constantly monitors token counts and queue lengths to distribute the reading tasks evenly across all available hardware. In teests, DualPath improved system throughput by nearly two times compared to a standard setup. Turns out that properly balancing network traffic that was already available in the cluster makes multi-turn agent workloads dramatically faster.

3

1

Oracle’s 21,000 layoffs help drive its debt-fueled AI investments (arstechnica.com)

submitted 2 days ago by cm0002@toast.ooo to c/Aii@programming.dev

0 comments fedilink

4

1

Meta's Program That Spies on Every Employee's Computer Just Blew Up in Its Face in Spectacular Fashion (futurism.com)

submitted 2 days ago by cm0002@toast.ooo to c/Aii@programming.dev

0 comments fedilink

5

1

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models (arxiv.org)

submitted 3 days ago by cm0002@toast.ooo to c/Aii@programming.dev

0 comments fedilink

What we have here is a massive reality check for the current obsession with blindly scaling up parameters to get better performance proving that you can squeeze frontier level logical reasoning into a tiny 3b parameter model. It managed to hit a score of 94.3 on the extremely difficult AIME26 math benchmark and got an 80.2 on LiveCodeBench v6 putting their incredibly small model in the exact same weight class as massive flagship models like Gemini 3 Pro.

They pulled it off using optimized post training pipeline based on their Spectrum to Signal paradigm starting with curriculum based supervised fine tuning to teach the model broad concepts before forcing it to focus on extremely hard and long reasoning problems. After that they ran multi domain reinforcement learning with a huge 64K context window to make sure the model could actually finish its long thoughts without getting artificially truncated. Another trick they used was to include a Long2Short reinforcement learning stage designed to force the model to be more token efficient in its math reasoning without losing accuracy. And tied it all together with offline self distillation to bake advanced reasoning skills into the base model.

The authors argue that the industry has been conflating two different types of artificial intelligence capabilities. Memorizing world knowledge and random facts naturally requires an expansive amount of parameters. However, pure verifiable reasoning like math and code is actually parameter dense because it is mostly just search, constraint satisfaction, and error correction. So you can tightly compress a world class reasoning engine into a tiny model without needing hundreds of billions of parameters to store random trivia. A big takeaway here is that small models aren't just cheap fallbacks for when you cannot afford massive compute and can legitimately be used for building top tier reasoning systems.

https://huggingface.co/WeiboAI/VibeThinker-3B

6

1

There has been a situation in AI (GLM5.2) - Sentdex (www.youtube.com)

submitted 5 days ago by cm0002@mander.xyz to c/Aii@programming.dev

0 comments fedilink

7

1

Agent memory on Elasticsearch: hybrid retrieval and DLS (www.elastic.co)

submitted 1 week ago by cm0002@mander.xyz to c/Aii@programming.dev

0 comments fedilink

8

1

Local Qwen isn't a worse Opus, it's a different tool (blog.alexellis.io)

submitted 1 week ago by cm0002@mander.xyz to c/Aii@programming.dev

0 comments fedilink

9

1

GLM-5.2 is the new leading open weights model on Artificial Analysis (artificialanalysis.ai)

submitted 1 week ago by cm0002@mander.xyz to c/Aii@programming.dev

0 comments fedilink

10

1

New #1 open-source AI model is here! (www.youtube.com)

submitted 1 week ago by cm0002@mander.xyz to c/Aii@programming.dev

0 comments fedilink

11

1

DeepSeek V4 Pro at 5% the cost of Claude — what it takes to close the gap (howardchen.substack.com)

submitted 1 week ago by cm0002@mander.xyz to c/Aii@programming.dev

0 comments fedilink

12

1

Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence (qwen.ai)

submitted 1 week ago by cm0002@mander.xyz to c/Aii@programming.dev

0 comments fedilink

13

1

How Developers React to AI-Scented Blog Posts (writethatblog.substack.com)

submitted 1 week ago by codeinabox@programming.dev to c/Aii@programming.dev

0 comments fedilink

14

1

Running local models is good now (vickiboykis.com)

submitted 1 week ago by cm0002@mander.xyz to c/Aii@programming.dev

0 comments fedilink

15

1

OpenRouter shows that multiple smaller models working together surpass frontier performance (openrouter.ai)

submitted 1 week ago by cm0002@europe.pub to c/Aii@programming.dev

0 comments fedilink

16

1

AI GPUs probably live longer than three years (www.seangoedecke.com)

submitted 1 week ago by codeinabox@programming.dev to c/Aii@programming.dev

0 comments fedilink

17

1

RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8 (imil.net)

submitted 1 week ago by cm0002@europe.pub to c/Aii@programming.dev

0 comments fedilink

18

1

Jeff Bezos's Prometheus raises $12B to build an 'artificial general engineer' for the physical world (techcrunch.com)

submitted 2 weeks ago by cm0002@europe.pub to c/Aii@programming.dev

0 comments fedilink

19

1

Stack Overflow is launching a version of itself for AI agents (www.neowin.net)

submitted 2 weeks ago by nemeski@mander.xyz to c/Aii@programming.dev

0 comments fedilink

20

1

Attention Residuals (arxiv.org)

submitted 2 weeks ago* (last edited 2 weeks ago) by cm0002@literature.cafe to c/Aii@programming.dev

0 comments fedilink

A new paper from Moonshot AI tackles a key bottleneck in how language models handle depth. Standard residual connections just add up the outputs of all previous layers using fixed uniform weights, and uniform addition creates a problem where hidden states grow uncontrollably as the network gets deeper. As a result, the contributions of early layers end up getting completely buried and diluted by the time the data reaches the end of the model.

This happens to be the exact same issue older recurrent neural networks faced over time before attention mechanisms came along. Naturally, they tackle the problem in a similar way using attention residuals instead of a fixed accumulation and applying a softmax attention mechanism over the outputs of preceding layers. Now, every single layer gets a learned pseudo query vector that lets it selectively pick and choose which earlier representations it actually needs to look at. This allows the network to naturally retrieve information from anywhere in its depth depending on the specific input.

However, applying this over every individual layer is called Full AttnRes and it comes with a massive catch which is that saving all those individual layer outputs creates memory and communication bottlenecks during large scale distributed training because the overhead scales linearly with the number of layers. So, in order to make the architecture actually usable they grouped the layers into chunks and summed up the outputs inside each block. The cross layer attention is then only applied over these compressed block level summaries rather than every single layer drastically reducing the memory and communication footprint.

By combining a block structure with a smart cross stage caching system and a two phase computation strategy the setup becomes a practical drop in replacement with practically zero training overhead. Their experimental results show that the performance boost holds up consistently across different model sizes.

21

1

AI CEOs from OpenAI, Anthropic, and Microsoft set aside their rivalry to warn Congress AI is making it too easy to design and create bioweapons (fortune.com)

submitted 2 weeks ago by monica_b1998@lemmy.world to c/Aii@programming.dev

0 comments fedilink

22

1

Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models (arxiv.org)

submitted 2 weeks ago by cm0002@suppo.fi to c/Aii@programming.dev

0 comments fedilink

23

1