Past the benchmarks: Understanding the coding personalities of various LLMs

Most stories evaluating AI fashions are based mostly on benchmarks of efficiency, however a latest analysis report from Sonar takes a distinct method: grouping totally different fashions by their coding personalities and looking out on the downsides of every in the case of code high quality.

The researchers studied 5 totally different LLMs utilizing the SonarQube Enterprise static evaluation engine on over 4,000 Java assignments. The LLMs reviewed have been Claude Sonnet 4, OpenCoder-8B, Llama 3.2 90B, GPT-4o, and Claude Sonnet 3.7.

They discovered that the fashions had totally different traits, equivalent to Claude Sonnet 4 being very verbose in its outputs, producing over 3x as many traces of code as OpenCoder-8B for a similar downside.

Based mostly on these traits, the researchers divided the 5 fashions into coding archetypes. Claude Sonnet 4 was the “senior architect,” writing subtle, complicated code, however introducing high-severity bugs. “Due to the extent of technical issue tried, there have been extra of those points,” mentioned Donald Fischer, a VP at Sonar.

OpenCoder-8B was the “fast prototyper” on account of it being the quickest and most concise whereas additionally probably creating technical debt, making it supreme for proof-of-concepts. It created the very best problem density of all of the fashions, with 32.45 points per thousand traces of code.

Llama 3.2 90B was the “unfulfilled promise,” as its scale and backing implies it must be a top-tier mannequin, however it solely had a move fee of 61.47%. Moreover, 70.73% of the vulnerabilities it created have been “BLOCKER” severity, probably the most extreme kind of bug, which prevents testing from persevering with.

GPT-4o was an “environment friendly generalist,” a jack-of-all-trades that may be a widespread selection for general-purpose coding help. Its code wasn’t as verbose because the senior architect or as concise because the fast prototyper, however someplace within the center. It additionally prevented producing extreme bugs for probably the most half, however 48.15% of its bugs have been control-flow errors.

“This paints an image of a coder who appropriately grasps the principle goal however usually fumbles

the main points required to make the code sturdy. The code is more likely to perform for the supposed state of affairs however can be suffering from persistent issues that compromise high quality and reliability over time,” the report states.

Lastly, Claude 3.7 Sonnet was a “balanced predecessor.” The researchers discovered that it was a succesful developer that produced well-documented code, however nonetheless launched numerous extreme vulnerabilities.

Although the fashions did have these distinct personalities, additionally they shared comparable strengths and weaknesses. The widespread strengths have been that they rapidly produced syntactically right code, had strong algorithmic and information construction fundamentals, and effectively translated code to totally different languages. The widespread weaknesses have been that all of them produced a excessive share of high-severity vulnerabilities, launched extreme bugs like useful resource leaks or API contract violations, and had an inherent bias in the direction of messy code.

“Like people, they develop into inclined to refined points within the code they generate, and so there’s this correlation between functionality and threat introduction, which I believe is amazingly human,” mentioned Fischer.

One other attention-grabbing discovering of the report is that newer fashions could also be extra technically succesful, however are additionally extra more likely to generate dangerous code. For instance, Claude Sonnet 4 has a 6.3% enchancment over Claude 3.7 Sonnet on benchmark move charges, however the points it generated have been 93% extra more likely to be “BLOCKER” severity.

“When you suppose the newer mannequin is superior, give it some thought another time as a result of newer shouldn’t be really superior; it’s injecting an increasing number of points,” mentioned Prasenjit Sarkar, options advertising and marketing supervisor at Sonar.

How reasoning modes influence GPT-5

The researchers adopted up their report this week with new information on GPT-5 and the way the 4 accessible reasoning modes—minimal, low, medium, and excessive—influence efficiency, safety, and code high quality.

They discovered that growing reasoning has a diminishing return on practical efficiency. Bumping up from minimal to low ends in the mannequin’s move fee rising from 75% to 80%, however medium and excessive solely had a move fee of 81.96% and 81.68%, respectively.

By way of safety, excessive and low reasoning modes eradicate widespread assaults like path-traversal and injection, however exchange them with harder-to-detect flaws, like insufficient I/O error-handling. The low reasoning mode had the very best share of that problem at 51%, adopted by excessive (44%), medium (36%), and minimal (30%).

“We’ve got seen the path-traversal and injection develop into zero p.c,” mentioned Sarkar. “We are able to see that they’re attempting to resolve one sector, and what’s taking place is that whereas they’re attempting to resolve code high quality, they’re someplace doing this trade-off. Insufficient I/O error-handling is one other downside that has skyrocketed. When you have a look at 4o, it has gone to 15-20% extra within the newer mannequin.”

There was the same sample with bugs, with control-flow errors reducing past minimal reasoning, however superior bugs like concurrency / threading growing alongside the reasoning issue.

“The trade-offs are the important thing factor right here,” mentioned Fischer. “It’s not as simple as to say, which is one of the best mannequin? The way in which this has been considered within the horse race between totally different fashions is which of them full probably the most variety of options on the SWE-bench benchmark. As we’ve demonstrated, the fashions that may do extra, that push the boundaries, additionally they introduce extra safety vulnerabilities, they introduce extra maintainability points.”

Supply hyperlink

Post Views: 71

What's Hot

OWASP Prime 10 up to date after 4 years, with lots of the identical issues nonetheless impacting functions

Webflow launches new vibe coding functionality referred to as App Gen

When AI Drove the Value of Testing to Zero

Past the benchmarks: Understanding the coding personalities of various LLMs

OWASP Prime 10 up to date after 4 years, with lots of the identical issues nonetheless impacting functions

Webflow launches new vibe coding functionality referred to as App Gen

When AI Drove the Value of Testing to Zero

Report: AI could result in quicker coding, however introduces new bottlenecks that decelerate supply

Leave A Reply Cancel Reply

OWASP Prime 10 up to date after 4 years, with lots of the identical issues nonetheless impacting functions

Webflow launches new vibe coding functionality referred to as App Gen

When AI Drove the Value of Testing to Zero

Report: AI could result in quicker coding, however introduces new bottlenecks that decelerate supply

Subscribe to Updates

What's Hot

Past the benchmarks: Understanding the coding personalities of various LLMs

How reasoning modes influence GPT-5

Related Posts

Leave A Reply Cancel Reply