Ethan B. Holland

Over 51,300 manually organized AI links and counting

Benchmarks: AI News Week Ending 01/23/2026

January 23, 2026

Image created with gemini-2.5-flash-image with claude-sonnet-4-5. Image prompt: Top-down isometric pixel art view of a drag racing strip in PS1 GTA1 style with chunky sprite AI models lined up at starting positions, massive glowing scoreboard displaying rankings overhead, checkered flags and measurement markers on saturated neon-lit asphalt, tiny pixelated spectator sprites gathered at finish line, high contrast with hot pink and electric blue neon accents against dark track, chunky 32-bit aesthetic with visible pixels and dithered textures.

AI agents have gotten good enough at long horizon tasks that it is an inflection point in the impact of AI at work. Agreement on this from METR, GDPval & now Anthropic. If you have a tool that saves 8 hours 65% of the time, that changes work, even counting potential error rates.”” https://x.com/emollick/status/2012237630411292859

Gartner Says Worldwide AI Spending Will Total $2.5 Trillion in 2026 https://www.gartner.com/en/newsroom/press-releases/2026-1-15-gartner-says-worldwide-ai-spending-will-total-2-point-5-trillion-dollars-in-2026

AI data centers can now use as much power as New York State uses on the hottest days of the years. We find that data centers currently have a total capacity of around 30 GW.”” https://x.com/EpochAIResearch/status/2012303496465498490

Benchmarking AI Agent Memory: Is a Filesystem All You Need? | Letta https://www.letta.com/blog/benchmarking-ai-agent-memory

Frontier AI Auditing: Toward Rigorous Third-Party Assessment of Safety and Security Practices at Leading AI Companies — AVERI https://www.averi.org/ourwork/frontier-ai-auditing

What if your bricklayer could fly? Researchers at Imperial College London tested a new way to build using cooperating drones instead of cranes or scaffolding. Two drones work together. One lays down foam and lightweight cement. The other checks accuracy while humans supervise”” https://x.com/IlirAliu_/status/2012963986069745862

How well did forecasters predict 2025 AI progress? According to the @aidigest_’s survey, forecasters: – Mostly nailed benchmark scores – Underestimated risks from AI-enabled bioweapons – Underestimated revenue by almost 2× – Overestimated public concern about AI Details in 🧵”” https://x.com/EpochAIResearch/status/2012264230028984493

Thoughts on Evals – Raindrop Blog https://www.raindrop.ai/blog/thoughts-on-evals

Without Benchmarking LLMs, You’re Likely Overpaying 5-10x | Karl Lorey https://karllorey.com/posts/without-benchmarking-llms-youre-overpaying

Community Benchmarks: Evaluating modern AI on Kaggle https://blog.google/innovation-and-ai/technology/developers-tools/kaggle-community-benchmarks/

Since OpenAI didn’t update Figure 7 from GDPval given the success rate of GPT-5.2 on long-form tasks, I used GPT-5.2 Pro to do so. The chart assumes the process is: delegate long tasks to AI, evaluate the output for an hour, then decide to try again or give up & do it yourself.”” https://x.com/emollick/status/2013243362229256550

📈👨🏻‍💻”” https://x.com/alexandr_wang/status/2013403027655532672

Actually the majority of studies have so far found that AI reduces inequality and closes skill gaps There are relatively few that found the opposite–that experts got even better. One was compelling enough to build a narrative that AI would increase skill gap. But it wasn’t real”” https://x.com/alexolegimas/status/2012334799998861451

Korea Kicks Off AI Squid Game for Best Sovereign Foundation Models – Bloomberg https://www.bloomberg.com/news/features/2026-01-19/korea-kicks-off-ai-squid-game-for-best-sovereign-foundation-models