Google ranks the very best AI for constructing Android apps, and the winner is not Gemini

Google needs software program builders to make use of the absolute best AI fashions when constructing Android purposes; consequently, the corporate debuted its Android Bench benchmarking portal in March. The service is meant to supply a repeatedly up to date leaderboard to behave as a reference level for builders and mannequin creators.

The leaderboard was up to date final week to incorporate open-weight fashions and add new columns for latency, tokens, and value.

“By establishing a transparent, dependable baseline for what high-quality Android improvement seems to be like, we’re serving to mannequin creators determine gaps and speed up enhancements — which empowers builders to work extra effectively.”
—Matthew McCullough, Google.

Mannequin college students

Matthew McCullough, Google VP of product for the Android Developer division, writes in a March blog post that Google actively benchmarks high AI LLMs in opposition to assessments designed to evaluate how these instruments can construct Android apps.

“Our aim is to supply mannequin creators with a benchmark to judge LLM capabilities for Android improvement,” explains McCullough. “By establishing a transparent, dependable baseline for what high-quality Android improvement seems to be like, we’re serving to mannequin creators determine gaps and speed up enhancements — which empowers builders to work extra effectively with a wider vary of useful fashions to decide on for AI help — which finally will result in higher-quality apps throughout the Android ecosystem.”

GPT 5.5 is presently the very best AI mannequin for Android

This new service doesn’t seem to supply a historic file of the place fashions have risen and fallen over time, however 9to5Google stories that the final Android Bench ranked Gemini 3.1 Professional alongside OpenAI’s GPT 5.4 as joint leaders.

As of the Might 18 replace, GPT 5.5 is presently the very best AI mannequin for Android app improvement.

Google gives an openly accessible explanation of its operating methodology for Android Bench to elucidate that, “The service evaluates the flexibility of LLMs to generate code that resolves the difficulty by presenting them with real-world points and pull requests from open-source software program initiatives. This strategy goals to make sure that the duties are consultant of the challenges builders face every day.”

Why did Google construct Android Bench?

Google has mentioned it constructed Android Bench as a result of AI-assisted software program engineering “has seen the emergence of a number of benchmarks” for measuring LLM capabilities. The corporate has additional acknowledged that Android builders “face particular challenges that aren’t lined by current benchmarks”, so it created a rating service that to give attention to a complete complete evaluation of high-quality Android improvement.

“We created a model-agnostic benchmark to precisely consider LLM efficiency on a wide range of Android improvement duties,” stated Google. The corporate additional outlined the targets of Android Bench as a method of encouraging LLM enhancements for Android improvement; empowering Android builders to be extra productive with a spread of “useful fashions” for AI help; and resulting in higher-quality purposes throughout the Android ecosystem.

Do software program improvement benchmarks work?

Builders and mannequin creators will naturally query whether or not Google’s motion to arrange this benchmarking is beneficial. Naysayers may naturally level to Goodhart’s Law, which states that, “When a measure turns into a goal, it ceases to be a great measure.” Definitely, any reward system can entice actors who optimize their actions to attain standardized targets.

Google might have second-guessed this pitfall by establishing Android Bench primarily based upon real-world public code repositories.

“We created the benchmark by curating a job set in opposition to a spread of frequent Android improvement areas. It’s composed of actual challenges of various issue, sourced from public GitHub Android repositories,” writes Google’s McCullough.

This implies eventualities examined in opposition to embody resolving “breaking modifications” throughout Android releases (when code that labored tremendous beforehand turns into corrupted because of Google updating Android to a brand new model), domain-specific duties reminiscent of networking for wearable units (the place the specter of excessive latency and frequent disconnections is at all times a menace), and migrating to the newest model of Jetpack Compose (Android’s personal declarative UI toolkit that makes use of Kotlin language capabilities), and extra.

What different Android benchmarks exist?

Different Android benchmarks embody Jetpack Microbenchmark, a library that permits builders to benchmark their Android native code — whether or not written in Kotlin or Java — from inside Android Studio. The sister Jetbank Macrobenchmark is supplied to check large-scale consumer interactions, reminiscent of chilly app startup time or the fluidity of consumer interface animations.

Additionally out there within the Android benchmarking area is Firebase Performance Monitoring, a production-level discipline benchmark that displays an app’s community requests and display rendering instances; that is extra of an utility efficiency monitoring software.

Inside the Android developer neighborhood, Android Vitals already gives a dashboard to trace app high quality metrics reminiscent of stability, efficiency, battery utilization, and permission points. Apptim is a generative AI cell app profiling and testing software, so once more, efficiency benchmarking, however not fairly the identical as Android Bench. We might additionally point out Google’s personal Android Efficiency Analyzer (APA). which solely arrived on 19 Might this 12 months and serves as a profiling and efficiency evaluation software with workflow simplification help.

“Open benchmarks like Android Bench are nice, and we want there have been extra of them. The caveat is knowledge contamination. Public repositories leak into coaching, and we have now seen fashions that cluster inside a number of factors on public evals unfold dramatically on non-public benchmarks constructed to imitate the identical workload.” – Andrew Filev, CEO, Zencoder.

Andrew Filev, CEO and founding father of code orchestration firm Zencoder, tells The New Stack that he’s a fan of those programs, with caveats.

“Open benchmarks like Android Bench are nice, and we want there have been extra of them,” Filev enthuses. “Generally phrases, software program improvement is simply too various for a single headline rating to be universally significant — a Python benchmark tells you little about how a mannequin handles Rust, embedded programs, or a cell app. There’s additionally an actual hole between constructing an open internet app, an inner software utilized by a number of hundred folks, and a multi-tenant product at a world scale, and fashions don’t carry out identically throughout these domains.”

Consequently, he says, domain-specific benchmarks push mannequin builders to concentrate to the environments their customers really work in, so he thinks that “Google deserves credit score right here” and hopes different platforms comply with.

“The caveat is knowledge contamination. Public repositories leak into coaching, and we have now seen fashions that cluster inside a number of factors on public evals unfold dramatically on non-public benchmarks constructed to imitate the identical workload,” Filev says. “In our personal analysis, a small change in how we framed check instances shifted the mannequin unfold from six proportion factors to 26 and utterly reordered the rankings. So public benchmarks assist enhance LLM efficiency throughout domains, and personal evals assist assess real-world efficiency in your workload.”

How Android Bench scores are constructed

Every Android Bench mannequin’s total benchmark rating relies on a Google-developed calculation comprising 4 core values.

The boldness interval (CI) vary (%) is a measure of the anticipated efficiency vary, reflecting the outcomes’ statistical reliability (p-value, 0.05); the typical latency rating is the time taken to unravel 100 duties throughout 10 runs; the typical complete tokens rating is a measure of token consumption for a full benchmark run throughout 10 runs; and the typical price is the fee per benchmark run on the time of testing, in US {dollars}.

The check harness for Android Bench is publicly available on GitHub.

Adrian Bridgwater is a expertise journalist with three many years of press expertise. He has an intensive background in communications, beginning in print media, newspapers and in addition tv. Primarily working as an evaluation author devoted to a software program utility improvement ‘beat’,…

Learn extra from Adrian Bridgwater