Benchmarks: AI News Week Ending 08/29/2025

Benchmarks: AI News Week Ending 08/29/2025

August 29, 2025

Image created with Flux Pro v1.1 Ultra. Image prompt: Giant “100” as pure white negative‑space cutout dominating the frame; minimalist poster style; tidy leaderboard bars rising cleanly behind the cutout with a trophy silhouette; steel‑blue backdrop; high contrast, crisp edges, soft studio light, no other text, no logos

“Put this shirt on him” Gemini 2.5 Flash Image Previously nano-banana https://x.com/skirano/status/1960343968320737397

🍌 nano banana is here → gemini-2.5-flash-image-preview – SOTA image generation and editing – incredible character consistency – lightning fast available in preview in AI Studio and the Gemini API https://x.com/googleaistudio/status/1960344388560904213

🚨🍌Big Reveal: who was “”Nano Banana?”” The anonymous model, “nano-banana,” that caught the world’s attention with its ability to follow complex instructions, preserve character identity, and maintain contextual details was: Gemini-2.5-Flash-Image-Preview by @GoogleDeepMind 🍌✨ https://x.com/lmarena_ai/status/1960342813599760516

🚨🍌Breaking News: Gemini-2.5-Flash-Image-Preview (“nano-banana”) by @GoogleDeepMind now ranks #1 in Image Edit Arena. In just two weeks: 🟡“nano-banana” has driven over 5 million community votes in the Arena 🟡Record-breaking 2.5M+ votes casted for this model alone 🟡It has https://x.com/lmarena_ai/status/1960343469370884462

A conversation with some of the research folks behind nano-banana 🍌 (aka Gemini 2.5 Flash Image) on how we got here, what it took to build this model, and where we go next! So much fun to hang with: @19kaushiks @robertriachi @m__dehghani @nbrichtova https://x.com/OfficialLoganK/status/1960725463694753930

An example of the new Google image generator. I gave it a random picture I took: “”make this a napoleon crochet book instead”” (note it made changes in consistent style) “”there should be a tiny sheep hidden among the blue yarn on the shelf to the right”” “”you misspelled Napoleon”” https://x.com/emollick/status/1960368483754992051

Edit your photos to match your imagination with a new image editing model, now available in the @GeminiApp 🖼️ Learn more ⬇️”” / X https://x.com/Google/status/1960342356881723469

First time I’ve seen Google’s blog unable to handle the traffic. Anyway, proud to be a launch partner with @GoogleDeepMind for Gemini 2.5 Flash Image, the first image-gen model on @OpenRouterAI! https://x.com/xanderatallah/status/1960358164693438934

Gemini / Nano Banana shows **remarkable** spacial understanding of images. I recursively asked it to “”make an image of the guy taking the photo”” (of the previous photo) Each time it adds a guy, it **gets their POV correct**, where they would actually be to capture it. https://x.com/BenjaminDEKR/status/1960566924884029539

Gemini 2.5 Flash creates an actually mildly amusing New Yorker cartoon. (As far as I can tell, this is the first time that this joke has been used) https://x.com/emollick/status/1960574571255304334

Google’s Gemini 2.5 Flash Image (Nano-Banana) takes the crown as the leading image editing model, beating GPT-4o and Qwen-Image-Edit in the Artificial Analysis Image Editing Arena! We were given early access and have been testing it in our arena under the pseudonym ‘rex’ for the https://x.com/ArtificialAnlys/status/1960388401401880898

i put nano banana into a browser extension so you can remix and edit any image on the web by just going right-click + prompt link below https://x.com/fabianstelzer/status/1960649240100647278

Image generation with Gemini just got a bananas upgrade and is the new state-of-the-art image generation and editing model. 🤯 From photorealistic masterpieces to mind-bending fantasy worlds, you can now natively produce, edit and refine visuals with new levels of reasoning, https://x.com/GoogleDeepMind/status/1960341906790957283

Introducing Gemini 2.5 Flash Image (aka nano-banana), our SOTA image generation and editing model 🍌 As you might have already seen, this model excels at character consistency, creative edits, and has Gemini’s world knowledge! https://x.com/OfficialLoganK/status/1960343135436906754

Introducing Gemini 2.5 Flash Image Preview, our best image generation and editing model! With conversational editing, multi-image composition. Test it now in @googleaistudio! 🍌 🎉 – Maintain character consistency across multiple prompts and images. – Targeted edits/replacing https://x.com/_philschmid/status/1960344024151199765

Introducing Gemini 2.5 Flash Image, our state-of-the-art image model – Google Developers Blog https://developers.googleblog.com/en/introducing-gemini-2-5-flash-image/

love these “”what does the red arrow see”” google maps transforms with nano-banana https://x.com/tokumin/status/1960583251460022626

Nano Banana Hackathon 🍌 2 Days API Free Tier Prizes & Credits Other cool Gemini stuff : ) See you next week, it’s time to build!!!”” / X https://x.com/OfficialLoganK/status/1961127857192673540

Nano banana is a genuinely impressive jump forward in AI image generation, a field I have followed closely. Whenever it is officially released, by whichever firm created it, I think it will have a significant impact on the applicability of AI image generation for real-world tasks”” / X https://x.com/emollick/status/1959727818255765933

Nano banana turns out to be Gemini Flash 2.5 Image Generation (not quite as catchy a name). I had a bit of early access, first through the same LMArena link everyone had, then privately. It is impressive, crossing a threshold that goes beyond toy (though is a pretty fun toy too) https://x.com/emollick/status/1960344601023168529

Nano Banana: Image editing in Google Gemini gets a major upgrade https://blog.google/products/gemini/updated-image-editing-model/

Not sure what happened at the end😱 Gemini 2.5 Flash gives us some interesting angles from a single image, while Kling 2.1’s first and last frames deliver those smooth transitions. https://x.com/heyglif/status/1960760956692136425

Ruining art with Gemini 2.5 Flash. (These are all the prompts, in their entirety) “”make this painting less gloomy”” “”it is still pretty disturbing, make it less gloomy emotionally”” “”even less gloomy”” https://x.com/emollick/status/1960717000092549349

Ruining even more art with Gemini 2.5 Flash. https://x.com/emollick/status/1960752417424990661

Since nano banana has gemini’s world knowledge, you can just upload screenshots of the real world and ask it to annotate stuff for you. “”you are a location-based AR experience generator. highlight [point of interest] in this image and annotate relevant information about it.”” https://x.com/bilawalsidhu/status/1960529167742853378

The new Gemini 2.5 image model🍌is by far the best out there with a whopping +180 ELO point lead in image editing & it really excels at character consistency. Available for free in the @GeminiApp right now. Try uploading an image & playing around with it, it’s pretty amazing! https://x.com/demishassabis/status/1960355658059891018

Two things I love coming together: Liverpool and Theme Park. From map to graphic – Nano-Banana is 🔥. Going to have to make a new isometric game, can’t resist… https://x.com/demishassabis/status/1961077016830083103

We’ve created a new prompting guide for Gemini 2.5 Flash Image to help you build solutions that use the model’s key capabilities like creative composition, consistent character design, targeted transformations, and more. Read the guide: https://x.com/googleaidevs/status/1960765662202061223

We’ve just upgraded Gemini 2.5 Flash image generation & editing! 🍌🍌🍌 Besides topping leaderboards, it topped my model usage this month. It keeps subjects consistent, you can make precise edits & combine creative elements. Have fun with it @GeminiApp @GoogleAIStudio https://x.com/OriolVinyalsML/status/1960343791283433842

We’ve tested hundreds of models at lmarena… never before has model hype brought in 2 MILLION new chats in one day. 🍌 nano banana by @GoogleDeepMind is here. https://x.com/cdngdev/status/1960355432037560697

Yes gemini 2.5 flash is bananas. Had early access for a bit and the spatial consistency + character coherence is impressive. The bigger deal here is that entire Photoshop & ComfyUI workflows are being collapsed down to a prompt. Closest thing we have to an image editor as an https://x.com/bilawalsidhu/status/1960377889112862766

Grok 2 from @xai has just been released on @huggingface: https://x.com/ClementDelangue/status/1959356467959439464

Grok-2 has been “”open sourced”” but has one of the worst licenses of any recent major open weights release. Given that it’s already quite outdated by the time they’ve got around to releasing it, combined with the license, this will see little use. It’s dead on arrival. https://x.com/xlr8harder/status/1959490601264533539

Pretty cool that they open sourced the actual full-sized production model. Here’s the Grok 2.5 architecture overview next to a roughly similarly sized Qwen3 model. The MoE residual is quite interesting. Kind of like a shared expert. I don’t think I’ve seen this setup before. https://x.com/rasbt/status/1959643038268920231

The @xAI Grok 2.5 model, which was our best model last year, is now open source. Grok 3 will be made open source in about 6 months. https://x.com/elonmusk/status/1959379349322313920

xAI just released Grok 2 on Hugging Face. This massive 500GB model, a core part of xAI’s 2024 work, is now openly available to push the boundaries of AI research. https://x.com/HuggingPapers/status/1959345658361475564

xai-org/grok-2 · Hugging Face https://huggingface.co/xai-org/grok-2

Grok now has a model card – which is a big step forward! But it is light on details, with unexplained results. Some examples: if the MASK measurement is the same as in the source paper, .43 would be a fairly high level of deception, also the sycophancy score is hard to interpret https://x.com/emollick/status/1959116132096336066

OpenAI just released HealthBench on Hugging Face. This new dataset is designed for rigorously evaluating large language models’ capabilities in improving human health. A vital step for AI in medicine! https://x.com/HuggingPapers/status/1960749923218895332

Hey AI, give me a clever, moving one paragraph story about a paradox, in any genre you desire. make it good”” These are the first attempts. A bit of the obvious time travel tales from Gemini and Grok. Claude loves to pull on your emotions. GPT-5 Pro goes in a stranger direction. https://x.com/emollick/status/1959817825729781837

One of the most pressing questions in our AI Evals course is: “”Why can’t I just have an LLM write my LLM pipeline?”” The nuanced answer is that you can use LLMs to assist, but not for the whole pipeline. Knowing where to put the LLM in the loop is the hard part. To unpack this,”” / X https://x.com/sh_reya/status/1961110090314125524

GLM-4.5 is now leading the Berkeley Function-Calling Leaderboard V4. Full results: https://x.com/Zai_org/status/1961149535754858586

The wild swings on X between “insane hype” and “its over” with each new AI release obscures a pretty clear situation: over the past year there seems to be continuing progress on meaningful benchmarks at a fairly stable, exponential pace, paired with significant cost reductions”” / X https://x.com/emollick/status/1958365687543484445

Wow glad to see vLLM powers @jiawzhao ‘s DeepConf work, impressive results on AIME 2025! Do you think this sampling control makes sense? Have a try and leave a comment in that PR https://x.com/vllm_project/status/1959277423729500565

Wow, a 40% increase with dspy.GEPA in just 500 metric calls! The optimized prompt is a 100-line illustrated process. https://x.com/DSPyOSS/status/1960000178179527110

Our early studies (and many others) found 20-30% productivity gains in controlled experiments in fields ranging from consulting to coding But translating gains to the organizational level takes time, and leadership. I wrote about many of the reasons here. https://x.com/emollick/status/1958350546831630810

weekly tokens processed went from ~111B to 3.21T in 1 year *on OpenRouter https://x.com/scaling01/status/1960113882607067569

We have a new land-speed record (hardware subtleties in build quality are fun) ~13GB/s read time off Thunderbolt4 RAID0 cluster cc @Prince_Canuma @ivanfioravanti https://x.com/TheZachMueller/status/1959730569195016589

vibe coding live with @kat_kampf, @ammaar, and nano-banana at 10 am pt, see you there!! https://x.com/OfficialLoganK/status/1960365102940500100

open source nano banana? bytedance just dropped USO, an open source editing model that… just works https://x.com/multimodalart/status/1961147988258295893

i don’t want much. i just want nano banana for video.”” / X https://x.com/bilawalsidhu/status/1958666988043346175

Fourth model launch of the day 🔥 – introducing Hermes 4, from @NousResearch Hermes 4 is trained for steerability and lower refusal rates, topping RefusalBench and beating Grok 4 https://x.com/OpenRouterAI/status/1960436262923592065

🚨 New benchmark release 🚨 We’re introducing Research-Eval: a diverse, high-quality benchmark for evaluating search-augmented LLMs 👉 Blogpost: https://x.com/RekaAILabs/status/1961192688029765936

Results Jet-Nemotron-2B outperforms or matches small full-attention models on MMLU, MMLU-Pro, BBH, math, commonsense, retrieval, coding, and long-context tasks. All this while delivering up to 47x decoding throughput at 64K and as high as 53.6x decoding and 6.14x prefilling https://x.com/omarsar0/status/1960724855709688053

Breaking: GPT-5 ranked 🥇 on Humanity’s Last Exam and 🥈 on MultiChallenge SEAL Leaderboards. https://x.com/scale_AI/status/1953591873031090505

AI progress/plateau discussions continue to be so weird because on every well-designed quantitative benchmark AI progress remains very much on the exponential track it has been on for the past year or more. But both AI hype & slowdown narratives tend to focus on vibes, instead.”” / X https://x.com/emollick/status/1959700023781777912

Stripping code formatting cuts LLM token cost without hurting accuracy. Average input tokens drop by 24.5%, with output quality basically unchanged. The core issue is simple, indentation, spaces, and newlines help humans read but they inflate tokens that models pay to process. https://x.com/rohanpaul_ai/status/1959634301932523958

It appears that the marginal energy used by a standard prompt from a modern LLM is relatively established at this point, roughly 0.0003 kWh (8-10 seconds of streaming Netflix) Water is more complicated (.25mL to 5mL+), depending on definitions. Training resources are less clear.”” / X https://x.com/emollick/status/1959989512228208785