Image created with OpenAI gpt-image-1. Image prompt: Single-panel cartoon with loose, hand‑inked lines, bean‑bodied figures, muted flat colors, minimal props, and deadpan humor: Silicon gym. Sweating computers bench‑press barbells labeled “latency” and “accuracy” while a clipboard‑wielding stopwatch shouts encouragement. Large bold title text centered at top: “BENCHMARKS” Muted colors, flat shading, black ink outlines. 16:9. Caption: “It’s leg‑day for the silicon set.”

“OpenAI’s o3 and o4-mini scores on the Extended NYT Connections benchmark This benchmark evaluates large language models (LLMs) using 651 NYT Connections puzzles, with additional words included to increase difficulty. The standard NYT Connections benchmark is nearing saturation, https://x.com/rohanpaul_ai/status/1913927366717342166

“New Arena launch: Sentiment Control – decoupling the impact of tone and emotion from response quality in human evaluation💗 How much do emojis, enthusiasm, and positive sentiment affect human preference? How can we adjust the leaderboard to counteract the effect of https://x.com/lmarena_ai/status/1914737052144558512

Google for Startups AI Academy: American Infrastructure cohort applications open https://blog.google/feed/google-for-startups-ai-academy-america-infrastructure-apply/

“More evidence that o3 represents a big move forward, this time on ARC-AGI. https://x.com/emollick/status/1914798775840706690

“Our biggest update so far: We’re excited to announce Collaboration in @JuliusAI_ Julius is now your team’s AI Data Analyst with built-in realtime collaboration. Check it out below https://x.com/0interestrates/status/1912904618364874850

“TextArena is live on arXiv! We present a benchmark of 57+ competitive text-based games to evaluate and train LLMs on agentic behavior — including negotiation, deception, theory of mind and many more. Real-time TrueSkill. Multiplayer support. Human-vs-models. Model-vs-model. https://x.com/LeonGuertler/status/1912355535489495471

“TextArena went live on Hugging Face It’s an open-source collection of competitive text-based games for LLMs, spanning 57+ unique environments Tests for different agentic behaviors—negotiation, theory of mind, deception, via competitive play https://x.com/rowancheung/status/1914567435228795391

“Geobench – A benchmark to measure how well llms can pinpoint the location based on a Google Streetview image. Basically it makes llms play the game GeoGuessr, and find out how well each model performs on common metrics in the GeoGuessr community – if it guess the correct https://x.com/rohanpaul_ai/status/1913350223247683980

“Why OpenAI doesn’t include other models in their own benchmarks: OpenAI-MRCR results with Gemini 2.5 Pro https://x.com/scaling01/status/1913955228442833032

“Have you tested out the new LMArena in Beta yet? ⏰In less than 24 hours, the community response has been incredible – we’ve already made tweaks based on your feedback! You’ll notice: 🌔 Dark/Light mode toggle in the top right ✂️ Copy/paste images directly into the prompt https://x.com/lmarena_ai/status/1913260116465656220

“Why do hardware companies struggle to build AI software that we can fall in love with? Dive in to learn more about the technology problems, incentives, and challenge of being an AI software team at a hardware company👇” / X https://x.com/clattner_llvm/status/1914814581266112858

“Very impressive. You can now use agents to do market research. Listen just raised $27M from Sequoia to replace surveys and focus groups with thousands of AI interviews. ▸ Interviews, analysis, insights in under 24h ▸ Auto-generates reports, themes, highlight reels ▸ Handles https://x.com/LiorOnAI/status/1915140553806946751

“SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM “In this work, we present two-Staged history-Resampling Policy Optimization (SRPO), which successfully surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks. https://x.com/iScienceLuvr/status/1914622980296192357

Analyzing o3 and o4-mini with ARC-AGI https://arcprize.org/blog/analyzing-o3-with-arc-agi

“Today we’re announcing Mechanize, a startup focused on developing virtual work environments, benchmarks, and training data that will enable the full automation of the economy. We will achieve this by creating simulated environments and evaluations that capture the full scope of” / X https://x.com/MechanizeWork/status/1912904151874625928

“I’m starting a new company: Mechanize. Mechanize will build virtual work environments, benchmarks, and training data to enable the full automation of all work. We’re hiring: hiring@mechanize.work.” / X https://x.com/tamaybes/status/1912905467376124240

Trending

Discover more from Ethan B. Holland

Subscribe now to keep reading and get access to the full archive.

Continue reading