AI - Artificial intelligence

272 readers
1 users here now

AI related news and articles.

Rules:

founded 11 months ago
MODERATORS
26
27
28
29
 
 

Our latest model, Claude Opus 4.7, is now generally available.

Opus 4.7 is a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks. Users report being able to hand off their hardest coding work—the kind that previously needed close supervision—to Opus 4.7 with confidence. Opus 4.7 handles complex, long-running tasks with rigor and consistency, pays precise attention to instructions, and devises ways to verify its own outputs before reporting back.

The model also has substantially better vision: it can see images in greater resolution. It’s more tasteful and creative when completing professional tasks, producing higher-quality interfaces, slides, and docs. And—although it is less broadly capable than our most powerful model, Claude Mythos Preview—it shows better results than Opus 4.6 across a range of benchmarks:

30
 
 

I think one of the biggest mistakes we have made as an industry is conflating the words "AI" and "LLMs." The irony is right there on the surface. Naming is one of the hardest things to do in software, and we've done it poorly for the primary tool of software.

31
32
33
34
35
36
 
 

Most Americans using AI tools for health purposes say they want immediate answers. In some cases, it helps them evaluate what kind of medical attention they need.

“It’ll let me know if something’s serious or not,” Davis said of ChatGPT, which she typically consults before scheduling medical appointments.

The Gallup survey found about 7 in 10 U.S. adults who have used AI for health research in the past 30 days say they wanted quick answers, additional information or were simply curious. Majorities used it for research before seeing a doctor or after an appointment.

37
 
 

AI models from Google, OpenAI, and Anthropic lost money betting on soccer matches over a Premier League season, in a new study suggesting even the most advanced systems struggle to analyze the real world over long periods.

38
39
40
 
 

"the company admitted it likely won’t be able to keep up with competing models."

"As such, the announcement is a bit of an enigma: if it can’t keep up with the competition, why release it at all? There’s a good change Meta is just trying to get its foot in the door — or a “seat at the big kid’s table,” as Wired put it. The company has struggled to stay relevant in a rapidly changing landscape" "Meta’s preceding Llama open source models largely failed to catch on, with a major controversy last year finding that Meta may have faked benchmark results to make its Llama 4 model seem more capable than it actually was."

41
42
 
 

Chinese AI company Z.ai has launched GLM-5.1, an open-source coding model it says is built for agentic software engineering. The release comes as AI vendors move beyond autocomplete-style coding tools toward systems that can handle software tasks over longer periods with less human input.

Z.ai said GLM-5.1 can sustain performance over hundreds of iterations, an ability it argues sets it apart from models that lose effectiveness in longer sessions.

As one example, the company said GLM-5.1 improved a vector database optimization task over more than 600 iterations and 6,000 tool calls, reaching 21,500 queries per second, about six times the best result achieved in a single 50-turn session.

In a research note, Z.ai said GLM-5.1 outperformed its predecessor, GLM-5, on several software engineering benchmarks and showed particular strength in repo generation, terminal-based problem solving, and repeated code optimization. The company said the model scored 58.4 on SWE-Bench Pro, compared with 55.1 for GLM-5, and above the scores it listed for OpenAI’s GPT-5.4, Anthropic’s Opus 4.6, and Google’s Gemini 3.1 Pro on that benchmark.

43
 
 

As businesses drink the agentic AI Kool-Aid and go looking for productivity enhancements, IT professionals can deliver by rebranding their existing automations as “zero-token architecture,” according to Kelsey Hightower, a former Google distinguished engineer and a notable early promoter of Kubernetes.

44
45
 
 

Dafny is a good intermediate step for LLM generated code.

this is the abstract of the paper:

Using large language models (LLMs) to generate source code from natural language prompts is a popular and promising idea with a wide range of applications. One of its limitations is that the generated code can be faulty at times, often in a subtle way, despite being presented to the user as correct. In this paper, we explore ways in which formal methods can assist with increasing the quality of code generated by an LLM. Instead of emitting code in a target language directly, we propose that the user guides the LLM to first generate an opaque intermediate representation, in the verification-aware language Dafny, that can be automatically validated for correctness against agreed on specifications. The correct Dafny program is then compiled to the target language and returned to the user. All user-system interactions throughout the procedure occur via natural language; Dafny code is never exposed. We describe our current prototype and report on its performance on the HumanEval Python code generation benchmarks.

46
47
48
49
50
view more: ‹ prev next ›