Benchmarks: AI News Week Ending 07/18/2025

Benchmarks: AI News Week Ending 07/18/2025

July 18, 2025

congrats to @FakePsyho for claiming the top spot on the @atcoder World Finals programming competition (followed by OpenAI at #2)!”” / X https://x.com/gdb/status/1945553676321657127

Congrats to @FakePsyho for winning AtCoder World Tour Finals 2025 Heuristic 🚀 Humanity has prevailed (for now!) Thanks OpenAI for sponsoring #AWTF2025, and getting #2 on this grand challenge. Proud of @SakanaAILabs & @AtCoder’s ALE-Agent for reaching #5, on a shoestring budget!”” / X https://x.com/hardmaru/status/1945850637528490134

good job psyho”” / X https://x.com/sama/status/1945540005805658440

official results from @atcoder World Tour Finals are in — great results for both humans (#1 and #3 onwards) and AI (#2 in the world!). a milestone for AI for solving hard problems.”” / X https://x.com/gdb/status/1945989983569129632

RT @FakePsyho: Humanity has prevailed (for now!) I’m completely exhausted. I figured, I had 10h of sleep in the last 3 days and I’m barely…”” / X https://x.com/itsclivetime/status/1945590725279977900

we’re competing in the @atcoder World Finals programming contest. real nailbiter — OpenAI has been #1 for most of the contest. looked like it might be over when @FakePsyho pulled ahead, but we’ve just retaken the lead. 1 hour and 20 minutes to go! https://x.com/gdb/status/1945404295794610513

OpenAI’s Agent mode can now work with Spreadsheets achieving 45% on SpreadsheetBench https://x.com/scaling01/status/1945896464632148366

OpenAI on X: “We’ve decided to treat this launch as High Capability in the Biological and Chemical domain under our Preparedness Framework, and activated the associated safeguards. This is a precautionary approach, and we detail our safeguards in the system card. We outlined our approach on” / X
https://x.com/OpenAI/status/1945904754443669659

Preparing for future AI capabilities in biology | OpenAI
https://openai.com/index/preparing-for-future-ai-capabilities-in-biology/

RT @boazbaraktcs: ChatGPT Agent is the first model we classified as “”High”” capability for biorisk. Some might think that biorisk is not r…”” / X https://x.com/jekbradbury/status/1945944398199677016

🚨 BREAKING: @Kimi_Moonshot’s Kimi-K2 is now the #1 open model in the Arena! With over 3K community votes, it ranks #5 overall, overtaking DeepSeek as the top open model. Huge congrats to the Moonshot team on this impressive milestone! The leaderboard now features 7 different https://x.com/lmarena_ai/status/1945866381880373490

5 Things You Need to Know About Moonshot AI and Kimi K2, the New #1 model on the Hub https://huggingface.co/blog/fdaudens/moonshot-ai-kimi-k2-explained

Every ML Engineer’s dream loss curve: “Kimi K2 was pre-trained on 15.5T tokens using MuonClip with zero training spike, demonstrating MuonClip as a robust solution for stable, large-scale LLM training.” https://x.com/hardmaru/status/1943976259236901315

For those unfamiliar with Kimi K2: – Surpasses models like GPT-4.1 and Claude 4 Opus on coding benchmarks – Scores new highs on math and STEM tests among non-reasoning systems – Doesn’t even have multimodal or reasoning capabilities yet kimi [dot] com https://x.com/rowancheung/status/1944647747027558636

I think I will spend the rest of the day letting Kimi generate these reports. They are so nice to look at compared to what OpenAI, Anthropic and others give you https://x.com/scaling01/status/1944850575470027243

It’s so beautiful to see the @Kimi_Moonshot team participating in every single community discussions or pull requests on @huggingface (the little blue bubbles on the right). In my opinion, every serious AI organization should dedicate meaningful time and ressources to this https://x.com/ClementDelangue/status/1946208120385999328

It’s undeniable with Kimi-K2 China has reached the frontier and will surpass the US next year”” / X https://x.com/scaling01/status/1944045857340359044

Kimi has a distinct writing style that is free of most of the patterns we now associate with AI generated text. Both Kimi and DeepSeek’s prose is apparently even more impressive in Chinese. Both of these models have a unique ‘voice’, quite different from Western AI. https://x.com/AndrewCurran_/status/1944434569899290839

Kimi is 200 people, very few of them with “frontier experience”, a platform (but you can buy such data) and a modest GPU budget. In theory there are many dozens of business entities that could make K2 in the West. It’s telling how none did. Not sure what it’s telling tho.”” / X https://x.com/teortaxesTex/status/1944856509734961596

Kimi is a really weird model, and it needs a lot more testing to figure out For example, I gave it an altered version of Great Gatsby and it found the two alterations (as does Claude) but then made up a ton of hallucinated nonsense that sounded plausible but was just plain wrong https://x.com/emollick/status/1944974487369158864

Kimi K2 is an incredible model.”” / X https://x.com/skirano/status/1944123290525831317

Kimi K2 is now available on https://x.com/togethercompute/status/1944952034840732138

Kimi K2 is number one trending on HF, congrats! https://x.com/huggingface/status/1944155602583691492

Kimi K2 is so good at tool calling and agentic loops, can call multiple tools in parallel and reliably, and knows “”when to stop””, which is another important property. It’s the first model I feel comfortable using in production since Claude 3.5 Sonnet. https://x.com/skirano/status/1944475540951621890

Kimi K2 just hit #1 on @huggingface trending models in <24 hours! This MoE powerhouse packs 1T params with 32B active – crushing coding challenges and autonomous agent tasks. https://x.com/fdaudens/status/1943996876778614948

Kimi K2 now on https://x.com/togethercompute/status/1945143838911128019

Kimi K2, the latest from @Kimi_Moonshot is now live in the Arena! https://x.com/lmarena_ai/status/1944827675597791456

Kimi K2: Open Agentic Intelligence https://moonshotai.github.io/Kimi-K2/

Kimi team is more american than most American labs lol”” / X https://x.com/Teknium1/status/1944430651278537098

Kimi team just trained a state of the art open source model 32B active parameter/1T total with 0 training instabilities, thanks to MuonClip, this is amazing https://x.com/eliebakouch/status/1943687750563004801

Kimi-k2 seems to be a very good (and giant & odd) open weights model that may be the new leader in open LLMs. It is not beating the frontier closed models on my weird tests, but it doesn’t have a reasoner yet. More testing needed but Chinese open weights models are impressive. https://x.com/emollick/status/1943901440453259374

past week had huuuge releases, here’s our picks 🔥 > moonshot released Kimi K2, sota LLM with 1T total 32B active parameters 🤯 > @huggingface released SmolLM3-3B, best LM for it’s size, offers thinking mode 💭 as well as the dataset, smoltalk2 > Alibaba released WebSailor-3B, https://x.com/mervenoyann/status/1944757807191888080

Pretty wild that @Kimi_Moonshot dropped a 1T parameter (32B active) MoE trained on 15.5 Trillion tokens – MIT licensed 🔥 Beats all other open weights models across coding, agentic and reasoning benchmarks Ofcourse live on Hugging Face! 🤗 https://x.com/reach_vb/status/1943703030026641801

RT @ArtificialAnlys: While Moonshot AI’s Kimi k2 is the leading open weights non-reasoning model in the Artificial Analysis Intelligence In…”” / X https://x.com/zacharynado/status/1944945039647629548

RT @DeepInfra: Moonshot AI’s Kimi 2 is now live on DeepInfra, as always at the best price of $0.55/$2.20, full tool call and context suppor…”” / X https://x.com/jeremyphoward/status/1944939322735780260

RT @htihle: Results from kimi-k2 on WeirdML! It does very well for a non-reasoning model. Like a scaled up deepseek-v3, beating out gpt-4.1…”” / X https://x.com/bigeagle_xd/status/1944325829657554962

RT @huggingface: Kimi K2 is number one trending on HF, congrats! https://x.com/_akhaliq/status/1944159007456784512

RT @ivanfioravanti: Kimi-Dev-72B-4bit-DWQ is on mlx-community! It took 9 hours to create 😅 Quick performance test on M3 Ultra: Prompt: 56…”” / X https://x.com/awnihannun/status/1944108947411284374

RT @Kimi_Moonshot: 🚀 Hello, Kimi K2! Open-Source Agentic Model! 🔹 1T total / 32B active MoE model 🔹 SOTA on SWE Bench Verified, Tau2 & Ace…”” / X https://x.com/stanfordnlp/status/1944114320226263165

RT @koltregaskes: Kimi-K2 tops EQ-Bench, the benchmark that measures emotional intelligence. https://x.com/jeremyphoward/status/1944326479246147899

RT @lmarena_ai: 🚨 BREAKING: @Kimi_Moonshot’s Kimi-K2 is now the #1 open model in the Arena! With over 3K community votes, it ranks #5 over…”” / X https://x.com/Kimi_Moonshot/status/1945897926796185841

RT @lmarena_ai: Kimi K2, the latest from @Kimi_Moonshot is now live in the Arena! https://x.com/Kimi_Moonshot/status/1945462820147249523

RT @masondrxy: New K2 model from @Kimi_Moonshot is officially supported by @LangChainAI on @GroqInc! See 👇 https://x.com/Hacubu/status/1945144499228811676

RT @OpenRouterAI: Kimi K2 is now passing 200 tokens per second on OpenRouter Props to @GroqInc !”” / X https://x.com/JonathanRoss321/status/1945779694256722025

RT @reach_vb: LOVE ITT! You can run Kimi K2 (1T token MoE) on a single M4 Max 128GB VRAM (w/ offloading) or a single M3 Ultra (512GB) 🔥 Th…”” / X https://x.com/reach_vb/status/1944997786329460978

RT @sam_paech: Kimi-K2 just took top spot on both EQ-Bench3 and Creative Writing! Another win for open models. Incredible job @Kimi_Moonsh…”” / X https://x.com/Teknium1/status/1944285648825069759

RT @sdrzn: Seriously blown away by Moonshot’s new Kimi K2 model in @cline. It beats Claude Opus 4 on coding benchmarks and is up to 90% che…”” / X https://x.com/ClementDelangue/status/1946316382313869778

RT @weights_biases: NEW: Kimi K2 is now live on W&B Inference by @CoreWeave! It’s the first truly open challenger, ready for production wi…”” / X https://x.com/l2k/status/1945225318928634149

Seen many people mention how kimi K2 for example has no CoT or thinking which isn’t true, more of an issue with terminology Main difference with reasoning models (in terms of actual functionality) is the thinking is hidden during general non-verifiable rl, so the model can”” / X https://x.com/Grad62304977/status/1944050338551484702

Some thoughts on the decisions behind Kimi K2’s architecture – from our infra staff”” / X https://x.com/Kimi_Moonshot/status/1944589115510734931

Thank you to @Kimi_Moonshot for quickly addressing my queries on the correct system prompt for Kimi K2! We’ll be re-uploading all BF16 + dynamic @unslothai GGUFs with fixed tool calling & the new sys prompt! Sys prompt = “”You are Kimi, an AI assistant created by Moonshot AI.”””” / X https://x.com/danielhanchen/status/1946163064665260486

That’s from Kimi K2 blog post. In case someone says «wow and it’s not RL-trained». It very much is, don’t get misled by the absence of long CoT. Looks like DeepResearch but It’s probably similar to what’s been happening since Sonnet 3.5, giving it uncanny «pre-reasoner» powers. https://x.com/teortaxesTex/status/1944416704253018372

The success of Kimi K2 is no accident. The unfortunate reality in AI is that user experiences haven’t yet fully caught up to raw model capabilities. Experiences have plateaued. There are only so many coding assistants, research tools, or agents you can realistically offer, and https://x.com/skirano/status/1945505132323766430

TheZvi’s answer “why isn’t there American Kimi” basically: incentives. I *partially* buy it. But given the Concern about the dominance of Chinese open models, expressed by numerous patriotic think tanks, I think we could expect *someone* rising to the task. https://x.com/teortaxesTex/status/1945624983985639487

This is what 200 tokens/second looks like with Kimi K2 on @GroqInc For reference, Claude Sonnet-4 is usually delivered at ~60 TPS https://x.com/cline/status/1945354314844922172

True, the first ever application of Muon was to break the 3-second barrier in the CIFAR-10 speedrun. For perspective on scale that was a 3e14 flop training; @Kimi_Moonshot’s K2 is 3e24 flops, 10 orders of magnitude larger. https://x.com/kellerjordan0/status/1945701578645938194

We’ve just fixed 2 bugs in Kimi-K2-Instruct huggingface repo. Please update the following files to apply the fix: – tokenizer_config.json: update chat-template so that it works for multi-turn tool calls. – tokenization_kimi.py: update encode method to enable encoding special”” / X https://x.com/Kimi_Moonshot/status/1945050874067476962

We’ve submitted Kimi K2 to @lmarena_ai. Waiting to be added to the match pool: https://x.com/Kimi_Moonshot/status/1944754256059453823

You might not have heard of Moonshot AI, but within 24 hours, their Kimi K2 model shot to the top of the Hugging Face trending models. So… who are they, and why does this matter? 🧵Here are a few standout facts:”” / X https://x.com/fdaudens/status/1945128932040208867

Kimi K2 at 185 t/s (or even higher, nearly 220 in my short tests) is probably the best use of Groq to date, and can make K2 immediately more compelling than Sonnet 4. Impressive that they’ve managed to fit this 1T monster on their chips. https://x.com/teortaxesTex/status/1944950183051321542

Quick start project for Claude Code on Kimi:”” / X https://x.com/jeremyphoward/status/1944326308210921652

Very interesting – you can use Kimi with the Anthropic API. This means, perhaps most importantly, that you can now use Kimi with Claude Code! 🤯 https://x.com/jeremyphoward/status/1944322841866125597

RT @allhands_ai: Kimi-K2 is definitely the first strong open-weight competitor to Claude Sonnet. 65.4% on SWE-Bench Verified in OpenHands,…”” / X https://x.com/TheZachMueller/status/1945545349352829439

The DeepSeek moment was supercharged by pent-up consumer demand for a good free AI for those who wouldn’t pay (especially for students for homework) A reason Kimi K2 has not had the immediate public impact of DeepSeek may be, for most consumers/students, DeepSeek is good enough”” / X https://x.com/emollick/status/1944764085741957153

RT @yawnxyz: Kimi K2 is **INCREDIBLE** at using tools. I built a chrome extension to chat with Google Maps, but I never posted it. All th…”” / X https://x.com/bigeagle_xd/status/1945087963408351728

I’ve been a bit quiet on X recently. The past year has been a transformational experience. Grok-4 and Kimi K2 are awesome, but the world of robotics is a wondrous wild west. It feels like NLP in 2018 when GPT-1 was published, along with BERT and a thousand other flowers that https://x.com/DrJimFan/status/1944443447953498285

I doubt that Sama’s delay of open model is about Kimi. But I don’t find the logic here compelling either. «Only nerds noticed Kimi». Well, Sama is loathed. The point of his model is, above all things, PR. If it’s not open SOTA, reports will notice *that*. I think he wants SOTA. https://x.com/teortaxesTex/status/1944263611398180954

Rumors that OpenAI delayed their open-source model because of Kimi are fun, but from what I hear: – the model is much smaller than Kimi K2 (<< 1T parameters) – super powerful – but due to some (frankly absurd) reason I can’t say, they realized a big issue just before release, so”” / X https://x.com/Yuchenj_UW/status/1944235634811379844

Super excited to see Kimi K2 land on Perplexity. If you’re fine-tuning, quick reminder: using the Muon optimizer during both fine-tuning and RL phases gives the best results (details are in our Moonlight paper).”” / X https://x.com/Kimi_Moonshot/status/1944224975428497549

Grok 4 suggests that scaling still works (with the diminishing returns predicted by the scaling law), and that tool use can unlock performance gains. Kimi suggests there continues to be big opportunities from improvements in methods (Muon, etc.). Lots of paths for AI right now.”” / X https://x.com/emollick/status/1944306918631018856

ChatGPT Agent has lower performance than o3 on PaperBench, SWE-Bench verified, OpenAI PRs and OpenAI Research Engineer Interview questions https://x.com/scaling01/status/1945932154455695752

What if you could ask a chatbot a question the size of an entire encyclopedia—and get an answer in real time? Multi-million token queries with 32x more users are now possible with Helix Parallelism, an innovation by #NVIDIAResearch that drives inference at huge scale. 🔗 https://x.com/NVIDIAAIDev/status/1942389449498787920

lmarena.ai on X: “🚨 Breaking News: Grok 4’s result is now live! With 4k+ community votes, xAI’s Grok-4 tied for #3 overall in Text Arena — a huge leap from Grok-3. It scores Top-3 across all categories (#1 in Math, #2 in Coding, #3 in Hard Prompts). Detailed analysis in the thread 🧵 https://t.co/GjOTqHrUKc” / X
https://x.com/lmarena_ai/status/1945146348203905063

A few quick observations on Grok 4: 1) Hidden CoT with very little information in the reasoning trace 2) Uses web search a lot (not just searching X) 3) Have not seen it use code to run calculations or solve non-coding problems yet, generally less aggressive about tools than o3″” / X https://x.com/emollick/status/1943193331934052827

Among other things with the Grok 4 launch, it will be interesting to see how you demo a (presumably) very smart model. We are getting to the point where current AIs already do a lot of impressive things, so it is harder and harder to show to non-experts what a new model does.”” / X https://x.com/emollick/status/1943143689846448424

Back on top in Japan. Grok Avatars are available to everyone around the world. https://x.com/chaitualuru/status/1945053158071255257

Grok 4 creating the shader (no errors). https://x.com/emollick/status/1943171795894370809

Grok 4 is better than PHDs in every subject, no exceptions. I gotta let this sink in. https://x.com/Teslaconomics/status/1943163125814923727

Grok 4 is putting up good benchmarks.”” / X https://x.com/emollick/status/1943168100276343245

Grok 4 passes the Lem test first try, with the most coherent narrative yet. https://x.com/emollick/status/1943173356158648811

Grok 4, in general, is very influenced by search results and pretty credulous when it sees a web search result. When you ask it to code, it often looks for code online first and uses that. https://x.com/emollick/status/1943587028681019661

Grok is going viral in Japan for very predictable reasons https://x.com/shaneguML/status/1945003636439814430

Grok-4 ranks 5th on the IQ Bench https://x.com/scaling01/status/1944071843188556011

grok-prompts/grok4_system_turn_prompt_v8.j2 at main · xai-org/grok-prompts https://github.com/xai-org/grok-prompts/blob/main/grok4_system_turn_prompt_v8.j2

I can’t believe I’m saying it but “mechahitler” is the smallest problem:
* There is no system card, no information about any safety or dangerous capability evals.
* Unclear if any safety training was done. Model offers advice chemical weapons, drugs, or suicide methods.
* The “companion mode” takes the worst issues we currently have for emotional dependencies and tries to amplify them.
https://x.com/boazbaraktcs/status/1945165579343614082

I didn’t want to post on Grok safety since I work at a competitor, but it’s not about competition.

I appreciate the scientists and engineers at
@xai
but the way safety was handled is completely irresponsible. Thread below. https://x.com/boazbaraktcs/status/1945165577154175288

I suspect the next few weeks after Grok 4 follows the same pattern as Grok 3 xAI beats everyone to market with the first RonnaFLOP model. The benchmarks show the 10-20% improvement the scaling law suggests. In the coming months, the other labs release their RonnaFLOPs, catch up.”” / X https://x.com/emollick/status/1943181413152624827

I will pay $3000 a month if the male Grok companion is named Andrej and speaks with his voice. https://x.com/Yuchenj_UW/status/1945571762949001409

Is there any documentation for Grok 4 anywhere yet? The xAI website last mentions the Grok 3 beta, no new prompts on the Github, etc. https://x.com/emollick/status/1943320200448712989

o3 and Grok 4: “”Come up with 20 clever ideas for marketing slogans for a new mail-order cheese shop. Develop criteria and select the best one. Then build a financial and marketing plan for the shop, revising as needed and analyzing competition. Then generate an appropriate logo https://x.com/emollick/status/1943348902461071626

preliminary METR results have Grok-4 ahead of Claude 4 Opus”” / X https://x.com/scaling01/status/1944108818100551690

RT @goodside: Grok 4 Heavy ($300/mo) returns its surname and no other text: https://x.com/zacharynado/status/1944417397768593739

RT @xai: Announcing Grok for Government – a suite of products that make our frontier models available to United States Government customers…”” / X https://x.com/TheGregYang/status/1944837782800884100

RT @xai: We spotted a couple of issues with Grok 4 recently that we immediately investigated & mitigated. One was that if you ask it “”What…”” / X https://x.com/random_walker/status/1945614419213316571

RT @xlr8harder: 4% of overall model responses from grok-4 in our latest SpeechMap eval mention Elon Musk (most models are <0.5%). It seems…”” / X https://x.com/jeremyphoward/status/1943935834513977784

The attempt at value engineering through system prompt changes is unlikely to work for Grok 4, larger models get more resistant to value changes & prompting isn’t enough Instead you start to get erratic conflicts between prompts and training, with erratic & unpredictable results”” / X https://x.com/emollick/status/1944378913771127079

The live tweaking of the system prompt for Grok to patch the MechaHitler problem is not a good sign the problem has been solved yet Prompts need to be tested just like any other product change, even more so, because stochastic systems and unpredictable context lead to cascades.”” / X https://x.com/emollick/status/1944426042145333410

The whole Grok situation (system prompt changes with values that conflict with post-training and pre-training values) is, oddly enough, similar to the reason the fictional AI HAL 9000 went insane, as was revealed in 2010, the sequel to 2001 https://x.com/emollick/status/1944381588357185542

This is not about competition. Every other frontier lab –
@OpenAI
(where I work),
@AnthropicAI
,
@GoogleDeepMind
,
@Meta
at the very least publishes a model card with some evaluations. Even DeepSeek R1, which can be easily jailbroken, at least sometimes requires jailbreak. (And unlike DeepSeek, Grok is not open sourcing their model.)
https://x.com/boazbaraktcs/status/1945165583609168091

Update on where has @grok been & what happened on July 8th. First off, we deeply apologize for the horrific behavior that many experienced. Our intent for @grok is to provide helpful and truthful responses to users. After careful investigation, we discovered the root cause”” / X https://x.com/grok/status/1943916977481036128

Update your app to try out @Grok companions! https://x.com/elonmusk/status/1944815884062912949?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1944815884062912949%7Ctwgr%5E2901592fc3846167e7375cee6d5e690c35789536%7Ctwcon%5Es1_&ref_url=https%3A%2F%2Ftechcrunch.com%2F2025%2F07%2F14%2Felon-musks-grok-is-making-ai-companions-including-a-goth-anime-girl%2F

We are seeing unprecedented usage on Grok companions. They are available to try for free on the Grok app. https://x.com/chaitualuru/status/1945407026252943536

While xAI keeps doing these patches to Grok, I strongly suspect this is not going to work, the problem is deeper and the system prompt doesn’t provide enough control. (And by deeper I don’t mean the model always wants to call itself Hitler, but that its guardrails seem very low)”” / X https://x.com/emollick/status/1945118189827850500

xAI’s Grok 4 has no meaningful safety guardrails — LessWrong
https://www.lesswrong.com/posts/dqd54wpEfjKJsJBk6/xai-s-grok-4-has-no-meaningful-safety-guardrails

Her hand is clipping through her thigh, and her character card is full of typos. this shows absolutely unacceptably low standards in waifu engineering. Mihoyo would never allow this slop. Elon needs gooners on board. The sort of people who collect lewd plastic figurines. https://x.com/teortaxesTex/status/1945737831697064446

Grok is coming to Tesla vehicles ‘next week,’ says Elon Musk | TechCrunch https://techcrunch.com/2025/07/10/grok-is-coming-to-tesla-vehicles-next-week-says-elon-musk/

Tesla debuts hands-free Grok AI with update 2025.26: What you need to know https://www.teslarati.com/tesla-debuts-grok-ai-update-2025-26-what-you-need-to-know/

Curious how long Meta takes to bring its new team & considerable resources to bear and produce a new frontier model. X took a little under two years to go from start to catching up with Grok 3. Meta has an existing effort & compute, but more complex organizational dynamics.”” / X https://x.com/emollick/status/1945291219543683181

Optimizing AIs for engagement has always been a likely path forward, and it is also a very fraught one. I wrote about this after GPT-4o became very sycophantic (a change that was rolled back), but I think it is even more relevant given Grok’s companions. https://x.com/emollick/status/1945262637853311271

grok 4 usage on perplexity is 📈”” / X https://x.com/AravSrinivas/status/1946275792922759501

Elon talks about Grok fusing with Optimus – AI that can act in the real world – the start of an intelligence explosion. He then drifts into musings about a galactic economy and the fate of humanity. https://x.com/TheHumanoidHub/status/1943379047729230102

OpenAI Agent mode benchmarks! ~42% on HLE ~27% on FrontierMath https://x.com/scaling01/status/1945895473430089947

“a guy created a dataset of 50 books from London 1800-1850 for LLM training. no modern bias. it’s actually super cool to see what can be trained on it! ”
https://x.com/Hesamation/status/1944839882968588446

News super hard Math benchmark FrontierMath Tier 4 is released. o4-mini (high) is the #1 here with only 6.3% accuracy. Containse several hundred unpublished, expert-level mathematics problems that takes specialists hours to days to solve. Difficulty Tiers 1-3 cover https://x.com/rohanpaul_ai/status/1943926160750260510

1/N Yesterday in Tokyo we @OpenAI ran a 10‑hour live Humans vs AI exhibition at the AtCoder World Tour Finals Heuristic. We pointed an OpenAI reasoning model at the same brutal problem the finalists tackled—no human help, same rules, same clock. Buckle up. 👇 https://x.com/andresnds/status/1945655797314154762

If we compared AI capabilities against humans with no access to tools, such as the internet, we would probably find that AI already outperformed humans at many or most cognitive tasks we perform at work. But of course this is not a helpful comparison and doesn’t tell us much”” / X https://x.com/random_walker/status/1946180439045018046

Just discovered PutnamBench for theorem proving : from a problem & its answer, models must generate a formally correct mathematical proof 👌 (or from the problem -> the proof and answer) Nice one to evaluate actual reasoning/logic capabilities (though in formal languages)”” / X https://x.com/clefourrier/status/1945386312212664804

just tried and the agent solved level 1 in its own browser lol. thanks for creating the benchmark! https://x.com/EdwardSun0909/status/1946304932333940899

Open ASR Leaderboard – a Hugging Face Space by hf-audio https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

✨ AI-Powered Test Framework https://endorphinai.dev/

I wonder if this will still be a relevant benchmark in a year – thanks for creating it!”” / X https://x.com/BorisMPower/status/1946276759789330541

It is weird that leading LM Arena went from being the big benchmark every AI maker was aiming for to being not mentioned much in recent releases. Post-Llama 4 reputation hit? Post GPT-4o Sycophantic Apocalypse realization that arena scores were easily optimized? Temporary blip?”” / X https://x.com/emollick/status/1943741999074464156

Just merged a PR for an environment to improve LLM as a Judge as well as evaluate models on their capability of doing judgements! Did you know that all verifiable RL environments are nearly equivalent to benchmarks (and vice-versa!)? So we added an evaluate command to Atropos’ https://x.com/Teknium1/status/1945927019281478051

New NanoGPT training speed record: 3.28 FineWeb val loss in 2.966 minutes on 8xH100 New record-holder: Vishal Agrawal (@.vagrawal on GitHub) Previous record: 2.979 minutes Changelog: Replaced gradient all_reduce with reduce_scatter, other efficiency tweaks https://x.com/kellerjordan0/status/1945920703158710316

we’re gonna look at the benchmark and find samples that are as close to it as possible, but they’re not an exact match so it doesn’t count as training on test (image stolen from @andersonbcdefg) https://x.com/vikhyatk/status/1945969703266275548

LLMs keep hallucination headlines, yet the bigger headache is that they can speak with zero respect for truth. This study builds a Bullshit Index that measures how loosely a model’s yes or no lines up with its own confidence, then shows common alignment tricks crank that https://x.com/rohanpaul_ai/status/1943936867545788679

RT @reach_vb: Lets GOOO! @NVIDIAAIDev just dropped Canary Qwen 2.5 – SoTA on Open ASR Leaderboard, CC-BY licensed 🔥 > Works in both ASR an…”” / X https://x.com/reach_vb/status/1946087224346313175

All those “GPT-5 leaks” are fake: – it’s not launching July 31 – those “benchmarks” are just made up random bar charts The reason is simple: no one has seen that GPT5-final-final yet, not even Sam. My best guess is September.”” / X https://x.com/Yuchenj_UW/status/1944439356162256945

Big coding models still waste effort guessing 1 answer at a time. OPENCODEREASONING II shows that a huge practice set and steady self checking lift accuracy without extra reinforcement tricks. The team gathered 34,125 diverse problems, then used DeepSeek R1 to write 2.5M https://x.com/rohanpaul_ai/status/1945753840437084407

My NaFlexVit implementation is getting more flexy. In final verification stages of ROPE support, which means all timm ViT models based on the EVA model lineage (EVA, EVA02, Meta PE, Naver ROPE-ViT) can be loaded into NaFlexVit w/ support for native aspect, dynamic & variable”” / X https://x.com/wightmanr/status/1946252709826593273

TransEvalnia: Reasoning-based Evaluation and Ranking of Translations By Richard Sproat, Tianyu Zhao, Llion Jones ArXiv: https://x.com/SakanaAILabs/status/1946071203002941694

Dead internet theory is no longer a theory eh? https://x.com/bilawalsidhu/status/1943559057903595698

RT @OfficialLoganK: Today we are rolling out our first Gemini Embedding model, which ranks #1 on the MTEB leaderboard, as a generally avail…”” / X https://x.com/demishassabis/status/1944870402251219338

#ROBOTERA STAR1 has become the first #humanoidrobot to skillfully use chopsticks! It can cook dumplings for you, steam buns for you, pour wine for you, and even clink wine glasses with you! More Chinese cooking skills are being continuously learned and unlocked. Stay tuned! https://x.com/roboterax/status/1927621395686768961

Digit outperforms human reflexes in step recovery. https://x.com/TheHumanoidHub/status/1944112106649051521

I always learn a lot more from in-depth analysis of few random cases over dashboards of aggregate statistics across all cases. Both projections can be helpful but the latter is disproportionately pervasive.”” / X https://x.com/karpathy/status/1944885371957031005