Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Flat cartoon illustration of a cute coral-red lobster mascot character centered on dark charcoal background, white speech bubble with ‘TECH’ text and circuit board pattern border, minimal floating tech icons in background, clean geometric shapes, kawaii style, high contrast, web interface aesthetic

OpenClaw – Amazing Hands for a Brain That Doesn’t Yet Exist https://bengoertzel.substack.com/p/openclaw-amazing-hands-for-a-brain

On Recursive Self-Improvement (Part I) – by Dean W. Ball https://www.hyperdimensional.co/p/on-recursive-self-improvement-part

Within just 10 months, performance on the ARC-AGI-2 benchmark surpassed 75%. Let that sink in.”” https://x.com/kimmonismus/status/2018800964891984181

Gemini now processes over 10 billion tokens per minute via direct API use by our customers and the Gemini App just crossed 750M monthly active users : )”” https://x.com/OfficialLoganK/status/2019166152199459074

Google’s 52x AI Growth | Tomasz Tunguz https://tomtunguz.com/google-earnings-q4-2025/

Google’s Gemini app has surpassed 750M monthly active users | TechCrunch https://techcrunch.com/2026/02/04/googles-gemini-app-has-surpassed-750m-monthly-active-users/

Our Q4/FY’25 results are in. Thanks to our partners & employees, it was a tremendous quarter, exceeding $400B in annual revenue for the first time. Our full AI stack is fueling our progress, and Gemini 3 adoption has been faster than any other model in our history. We’re really”” https://x.com/sundarpichai/status/2019155348264042934

We’ve started to measure time horizons for recent models using our updated methodology. On this expanded suite of software tasks, we estimate that Gemini 3 Pro has a 50%-time-horizon of around 4 hrs (95% CI of 2 hr 10 mins to 7 hrs 20 mins).”” https://x.com/METR_Evals/status/2018752230376210586

The Gemini app hit 750M+ monthly active users in Q4 2025. ChatGPT was reported to have 810M monthly active users by the end of 2025. The gap is shockingly small. Gemini has a real shot at passing ChatGPT.”” https://x.com/Yuchenj_UW/status/2019157674143936980

It’s so exponential, it literally looks like a wall. GPT-5.2 high sets new record in task duration. And it’s not even xhigh”” https://x.com/kimmonismus/status/2019174066565849193?s=46

We estimate that GPT-5.2 with `high` (not `xhigh`) reasoning effort has a 50%-time-horizon of around 6.6 hrs (95% CI of 3 hr 20 min to 17 hr 30 min) on our expanded suite of software tasks. This is the highest estimate for a time horizon measurement we have reported to date.”” https://x.com/METR_Evals/status/2019169900317798857

Next Gen News 2 (NGN2) – Future of News and Young Audiences https://www.next-gen-news.com/

Chat is Going to Eat the World – Dead Neurons https://deadneurons.substack.com/p/chat-is-going-to-eat-the-world

The Second Pre-training Paradigm”” https://x.com/DrJimFan/status/2018754323141054786

Carcinisation – Wikipedia https://en.wikipedia.org/wiki/Carcinisation

A pretty bold commentary in Nature written by linguists, computer scientists and philosophers declaring “”by reasonable standards, including Turing’s own, we have artificial systems that are generally intelligent. The long-standing problem of creating AGI has been solved.”””” https://x.com/emollick/status/2018524111627325554

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+ https://huggingface.co/blog/huggingface/one-year-since-the-deepseek-moment-blog-3

Brain Dumps as a Literary Form – by Dave Griffith https://davegriffith.substack.com/p/brain-dumps-as-a-literary-form

🔎 Evaluating Deep Agents: Here’s What We Learned 🔎 Deep agents can’t be evaluated like simple LLM tasks. After building and testing 4 production agents over the past few months, we learned that evaluating deep agents requires: 1. Bespoke test logic for each datapoint — each”” https://x.com/LangChain/status/2018769968515404212

We have been shipping 🛳️❤️ 📦 Community Evals & Benchmark Datasets: Benchmark datasets host benchmark leaderboards, you can now contribute eval results by opening a PR to model repositories, all PRs are fed to benchmark datasets 📦 Chat with datasets: agents live in Data”” https://x.com/huggingface/status/2019754567685050384

🏆 Agent-Centric Benchmark Results 🟣 SWE-Bench Verified: Qwen3-Coder-Next >70% with the SWE-Agent scaffold 🟣 Efficient but strong: Despite a small active footprint, it matches or exceeds several much larger open-source models on a range of agent benchmarks”” https://x.com/Alibaba_Qwen/status/2018719026558664987

🚀What Benchmark Design Tells Us About the Result of Step 3.5 Flash? Here’s a detailed breakdown from model infra engineer & Zhihu contributor P2oileen, who worked directly on the benchmarking infrastructure. 💬””If high scores can’t be reproduced, a tech report is just paper.”””” https://x.com/ZhihuFrontier/status/2019734062689304970

After my benchmark test, GLM OCR>paddleOCR1.5>deepseek OCR2 GLM OCR can even capture some small handwritten characters and is currently the state-of-the-art OCR model.”” https://x.com/bdsqlsz/status/2018663915404841212

Artificial Analysis released version 4.0 of its Intelligence Index, replacing saturated benchmarks with new tests focused on economically useful work, factual reliability, and reasoning. The update aims to better capture how large language models perform in business contexts,”” https://x.com/DeepLearningAI/status/2019169092024848512

Eval scores in 2026 are broken. MMLU at 91%+, GSM8K at 94%+, yet models still can’t handle basic multi-step tasks. And reported scores don’t even agree across model cards, papers, and platforms. We just shipped Community Evals on @huggingface: – Benchmark datasets now host live”” https://x.com/ben_burtenshaw/status/2019795723378942295

I don’t really think we’ll ever get to the point of ‘all benchmarks being saturated’, because we’ll always create harder ones, but we *definitely* aren’t there right now- SWE-bench Multilingual- best score: <80%. SciCode [subquestion]: 56% CritPt: 12% VideoGameBench: 1%”” https://x.com/OfirPress/status/2019755847149056456

I know it’s probably a great model I just wish they didnt cherry pick their benchmarks so much. Like where is MMLU, HLE, ARC AGI. Too bad @Huggingface shut down their leaderboard, and nobody else has stepped up. We also learned from @AIatMeta that we can’t just take the model”” https://x.com/QuixiAI/status/2018251816647938051

New SOTA public submission to ARC-AGI: – V1: 94.5%, $11.4/task – V2: 72.9%, $38.9/task Based on GPT 5.2, this bespoke refinement submission by @LandJohan ensembles many approaches together”” https://x.com/arcprize/status/2018746794310766668

Our DRACO Benchmark is fully open-source and we’re releasing the benchmark, rubrics, and methodology today. To learn more about methodology and detailed results, read the full paper: https://t.co/MDgnQ3E0kO The dataset is available on Hugging Face:”” https://x.com/perplexity_ai/status/2019126646054482294

very impressive coding model with a nice tech report, it’s only a 3B active MoE, with strong benchmark and hybrid linear attention (Gated DeltaNet) so efficient long context inference”” https://x.com/eliebakouch/status/2018730622358073384

we released Community Evals to fix transparency in evals 🤝 → Benchmark Datasets host leaderboards → create PRs to add eval result to the leaderboard, link models 🔗 leaderboards GPQA, HLE and MMLU-Pro are live, check how sota models like Kimi 2.5 compare 🙌🏻”” https://x.com/mervenoyann/status/2019784907178811644

We tested this with “”Oracle”” experiments across multiple benchmarks: – Index the corpus at several chunk sizes – Let an oracle pick the best size per query (with ground truth) Result: 20-40% better recall than ANY fixed chunk size. The optimal choice is query-dependent. >>”” https://x.com/YuvalinTheDeep/status/2018297202066481445

@finbarrtimbers DO NOT use FireworksAI to benchmark Kimi – They have failed to make any of it work right, tool calls aren’t parsed, model is shot up somehow in other ways”” https://x.com/Teknium/status/2018092504613285900

A week after PaddleOCR-VL-1.5 took the top spot on OmniDocBench, *another* 0.9B model dethrones it! GLM-OCR shows SOTA results on doc parsing benchmarks and it’s apparently 50-100% faster https://x.com/jerryjliu0/status/2018713059359899729

🚨 Top 10 Open Models in January: Text Arena Looking back last month, here are the rankings by provider for January: 🥇 #1 Kimi-K2.5-Thinking by @Kimi_Moonshot (Modified MIT) 🥈 #2 GLM-4.7 by @Zai_org (MIT) 🥉 #3 Qwen3-235b-a22b-instruct-2507 by @Alibaba_Qwen (Apache 2.0)”” https://x.com/arena/status/2018727506850033854

Today, we’re rolling out an Advanced version of Perplexity Deep Research, achieving state-of-the-art performance on external and internal benchmarks, beating every other deep research tool on accuracy, usability, and reliability across all verticals.”” https://x.com/AravSrinivas/status/2019129261584752909

We’ve upgraded Deep Research in Perplexity. Perplexity Deep Research achieves state-of-the-art performance on leading external benchmarks, outperforming other deep research tools on accuracy and reliability. Available now for Max users. Rolling out to Pro in the coming days.”” https://x.com/perplexity_ai/status/2019126571521761450

Humanoid whole-body control ASI benchmark”” https://x.com/TheHumanoidHub/status/2017293983115092168

Introducing the Artificial Analysis Video with Audio Arena! Compare video models with native audio generation including Veo 3.1, Grok Imagine, Sora 2, and Kling 2.6 Pro Since Google’s Veo 3 launched last May as the first major video model with native audio generation, many”” https://x.com/ArtificialAnlys/status/2019132516897288501

3/ These benchmarks aren’t just a leaderboard: They help us measure how AI models handle real-world skills like planning, communication and decision-making under uncertainty.”” https://x.com/Google/status/2019094601080992004

We’re advancing AI benchmarking systems by letting them play games 🕹️ Games like Poker & Werewolf in the @Kaggle Game Arena allow us to test AI capabilities and “”soft skills”” in controlled sandbox environments before they’re deployed 🧵 (1/4) ↓”” https://x.com/Google/status/2019094596588839191

📉✂️Image Arena Pareto Frontier: Image Edit Now let’s take a look at image editing. Looking at Arena Score versus price per image lets us see which models sit on the Pareto frontier across both efficient and highly complex image editing. Top models on the Pareto frontier for”” https://x.com/arena/status/2018792314878234704

📉🖼️Image Arena Pareto Frontier Image use cases vary widely. Sometimes you want the highest quality, and sometimes you need something efficient enough to run at scale. Looking at Arena Score versus price per image lets us see which models sit on the Pareto frontier. Top”” https://x.com/arena/status/2018787949840896119

🚨BREAKING: Kimi K2.5 by @Kimi_Moonshot is now the #1 open model in Code Arena! In Code Arena’s agentic coding evaluations, Kimi K2.5 is now: – #1 open model, surpassing GLM-4.7 – #5 overall, on par with top proprietary models like Gemini-3-Flash – The only open model in the top”” https://x.com/arena/status/2018355347485069800

We’re introducing WorldVQA, a new benchmark to measure atomic vision-centric world knowledge in Multimodal Large Language Models. Current evaluations often conflate visual knowledge retrieval with reasoning. In contrast, WorldVQA decouples these capabilities to strictly measure”” https://x.com/Kimi_Moonshot/status/2018697552456257945

NVFP4-QAD-Report.pdf https://research.nvidia.com/labs/nemotron/files/NVFP4-QAD-Report.pdf

[2601.21571] Shaping capabilities with token-level data filtering https://arxiv.org/abs/2601.21571

[2601.22975] Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text https://arxiv.org/abs/2601.22975

[2602.02361] SWE-Universe: Scale Real-World Verifiable Environments to Millions https://arxiv.org/abs/2602.02361

[AINews] AI vs SaaS: The Unreasonable Effectiveness of Centralizing the AI Heartbeat https://www.latent.space/p/ainews-ai-vs-saas-the-unreasonable

@PetarV_93 @fedzbar @ccperivol @sindero “”In this paper, we provide novel evidence that perplexity should not be blindly trusted as a model selection objective.”” Like this? 🤔⬇️”” https://x.com/DamienTeney/status/2018413621361967216

@Yuchenj_UW I tried to use it this way and basically failed, the models aren’t at the level where they can productively iterate on nanochat in an open-ended way. (Though one of the primary motivations for me writing nanochat is that I’d very much love for it to be used this way as a”” https://x.com/karpathy/status/2019851952033771710

> people claim a study found no efficiency gains using LLM for coding > look inside > participants used 4o in chat sidebar every time”” https://x.com/papayathreesome/status/2018169992752083034

🎉🎉🎉 Congrats to @StepFun_ai on releasing Step 3.5 Flash, and day-0 support is ready in vLLM! A 196B MoE that activates only 11B params per token, giving you frontier reasoning with exceptional efficiency. Highlights: • 74.4% SWE-bench Verified, 51.0% Terminal-Bench 2.0 •”” https://x.com/vllm_project/status/2018374448357998874

🚀 New in LangSmith: Customize trace previews. Control which parts of a trace show up directly in the tracing table. Surface what matters — whether it’s the last user message, or a nested output value. Skip the noise, debug faster. Docs 👉 https://x.com/LangChain/status/2019848808310706367

256 Tb/s data rates over 200 km distance have been demonstrated on single mode fiber optic, which works out to 32 GB of data in flight, “stored” in the fiber, with 32 TB/s bandwidth. Neural network inference and training can have deterministic weight reference patterns, so it is”” https://x.com/ID_AA_Carmack/status/2019839335382790342

AI’s trillion-dollar opportunity: Context graphs”” https://x.com/JayaGup10/status/2003525933534179480

An open source project developing a desktop “Pick-and-Place” machine. (📍All of the source on GitHub) The LumenPnP can be used for assembling electronic components onto circuit boards! The project is getting a huge update with a whole new control box, integrating the pneumatics”” https://x.com/IlirAliu_/status/2018037462560313384

belated ‘aha’ moment: Context engineering is as impt to inference as Data engineering is important to training”” https://x.com/swyx/status/2018533744442057115

For non-verifiable domains, the only way you can improve AI performance at this time is via curating more annotated training data, which is expensive and only yields logarithmic improvements. And here’s the thing: nearly all jobs have non-verifiable elements. There’s virtually”” https://x.com/fchollet/status/2019610121371054455

Gen AI Chatbots: February 2026 Apptopia Data Brief – Apptopia https://apptopia.com/en/blog/gen-ai-chatbots-february-2026-apptopia-data-brief/

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text “”constructing a multiple-choice question-answering version of the fill-in-the-middle task”” “”Given a source text, we prompt an LLM to identify and mask key reasoning steps, then”” https://x.com/iScienceLuvr/status/2018233829488484674

How does AI impact skill formation? https://www.seangoedecke.com/how-does-ai-impact-skill-formation/

HUSKY is a physics-aware framework for humanoid skateboarding, modeling the task as a hybrid dynamical system. – It derives a kinematic equality constraint between board tilt and truck steering to enable physics-informed policy learning. – Using Deep Reinforcement Learning”” https://x.com/TheHumanoidHub/status/2018932338366026232

If the Superintelligence were near fallacy — LessWrong https://www.lesswrong.com/posts/tkA9J8RxoEckH7Pop/if-the-superintelligence-were-near-fallacy

Increasing problem with publishing work on AI is that the publication process is much slower than working paper process, so when papers finally get full peer reviews, authors are asked to account for newer papers that are built on the paper under review! No real norms around this”” https://x.com/emollick/status/2018805872651276393

Introducing the new v0 – Vercel https://vercel.com/blog/introducing-the-new-v0

It took me weeks, but finally it’s there: an overlong blogpost on synthetic pretraining. https://x.com/Dorialexander/status/2018018715162288611

Model Vault: a private platform for secure model inference https://cohere.com/blog/model-vault

My PhD thesis is out 🥳🎓 How do LLMs, trained on trillions of tokens, reason? Can they generalise beyond their training data or are they constrained by what they’ve seen before? My takeaway: they can generalise beyond training in interesting ways, showing genuine reasoning”” https://x.com/LauraRuis/status/2019085266124759509

Synthetic Pretraining | Vintage Data https://vintagedata.org/blog/posts/synthetic-pretraining

This Stanford paper compresses documents into much smaller representations to reduce memory footprint and speed up LLM generation Background: many papers that focus on context compressions for LLMs. Gist tokens [1] compresses a prompt into a short KV cache, (1/N)”” https://x.com/gabriberton/status/2018097161343553770

What are Context Graphs? The Elegant Idea Everyone’s Talking About. https://simple.ai/p/what-are-context-graphs

What can be a “”LAMP stack”” for AI? Raffi Krikorian, CTO of @mozilla, suggests a clear way to think about it ->”” https://x.com/TheTuringPost/status/2016932531388440754

AlphaFace: High Fidelity and Real-time Face Swapper Robust to Facial Pose https://arxiv.org/pdf/2601.16429

EditYourself https://edit-yourself.github.io/

JOSH: Joint Optimization for 4D Human-Scene Reconstruction in the Wild”” TL;DR: A unified pipeline that jointly optimizes human motion, scene geometry, and camera pose from monocular video improving accuracy in wild reconstructions.”” https://x.com/Almorgand/status/2017259738761740341

Judgment isn’t uniquely human – by Steven Adler https://stevenadler.substack.com/p/judgment-isnt-uniquely-human

10 Charts That Explain the AI Era – by Deb Liu https://debliu.substack.com/p/10-charts-that-explain-the-ai-era

My AI Adoption Journey – Mitchell Hashimoto https://mitchellh.com/writing/my-ai-adoption-journey

[2601.22950] Perplexity Cannot Always Tell Right from Wrong https://arxiv.org/abs/2601.22950

Ah! Some people might know that I’ve long been suspicious of looking almost only at perplexity for pretrain… can’t wait to read this:”” https://x.com/giffmana/status/2018393065803620662

Introducing Model Council https://www.perplexity.ai/hub/blog/introducing-model-council

Using Interpretability to Identify a Novel Class of Alzheimer’s Biomarkers https://www.goodfire.ai/research/interpretability-for-alzheimers-detection

PaperBanana: Automating Academic Illustration for AI Scientists https://dwzhu-pku.github.io/PaperBanana/

BREAKING: @xAI’s Grok-Imagine-Video now #1 in Video Arena! For the first time, Grok-Imagine-Video-720p takes the top spot on the Image-to-Video leaderboard, overtaking Google’s Veo 3.1 while being 5x cheaper. Its 480p version released a few days ago ranks #4. Huge congrats to”” https://x.com/arena/status/2019204821551837665

Hard to know which X articles are valuable, but this is a good summary of the significance of world modeling by a distinguished scientist and robot expert NVIDIA”” https://x.com/emollick/status/2018774863734075878

Leave a Reply

Trending

Discover more from Ethan B. Holland

Subscribe now to keep reading and get access to the full archive.

Continue reading