Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: 1980s mainframe computer room in crisis, rows of cabinet servers with amber CRT displays showing cascading errors, sparks flying from overheating hardware, cooling fans visible, dark atmospheric lighting with amber and blue screen glow, massive bold red sans-serif text reading TECH in foreground, cinematic 1980s techno-thriller aesthetic, high contrast, foreboding mood

OpenAI just published a new 37-page report on how bad actors are attempting to misuse ChatGPT Some of the wild cases: – A fraud ring scaled personalized romance scams with AI-generated scripts – North Korea-linked actors used it to research crypto attack vectors and draft fake
https://x.com/TheRundownAI/status/2026743836949549253

Google’s Nano Banana 2 (Gemini 3.1 Flash Image Preview) takes #1 in Text to Image in the Artificial Analysis Image Arena at half the price of Nano Banana Pro! Nano Banana 2 is the latest Flash-tier image model from @GoogleDeepMind , succeeding the original Nano Banana (Gemini
https://x.com/ArtificialAnlys/status/2027052241019175148

Am currently putting together an article, and yeah, the SWE-Bench Verified numbers are definitely a bit sus across all models — the benchmark suggest they are more similar than they really are. So, I went down a rabbit hole looking into SWE-Bench Verified issues… And it looks
https://x.com/rasbt/status/2026062254571913522

Devin now has full computer use capabilities and can share screen recordings. You can control desktop apps, build and QA mobile apps, and automate tedious work. Here are some examples that blew our team away: 1. Making a desktop game
https://x.com/cognition/status/1983983151157563762

For years I’ve said that the capability-reliability gap is an under-appreciated limitation of AI agents. Finally, in a new paper led by @steverab, we defined and measured it!
https://x.com/random_walker/status/2026384543700115870

Frontier models have (mostly) stopped making dumb security mistakes. But, when running for a long time, like in agentic coding or OpenClaw, even a single mistake can be fatal. How can we benchmark this? Instead of making larger and larger agentic benchmarks, we made an easier
https://x.com/jonasgeiping/status/2026714911951220888

Lots of important ideas here! “Evaluating 14 models on two complementary benchmarks, we found that nearly two years of rapid capability progress have produced only modest reliability gains… Unfortunately, AI agents are evaluated based on a single number, the average success
https://x.com/JustinBullock14/status/2026693253169336475

Many teams treat evals as a last-mile check. https://t.co/8pFE1Aw4hH Service made them a Day 0 requirement for their AI service agents. Using LangSmith, the monday service team has been able to: 🔷Achieve 8.7x faster evaluation feedback loops (from 162 seconds to 18 seconds).
https://x.com/hwchase17/status/2026095629148258440

New research from Intuit AI Research. Agent performance depends on more than just the agent. It also depends on the quality of the tool descriptions it reads. However, tool interfaces are still written for humans, not LLMs. As the number of candidate tools grows, poor
https://x.com/omarsar0/status/2026676835539628465

Our new SWE-bench Multilingual leaderboard compares software engineering performance across 9 different languages as evaluated with mini-SWE-agent v2. Model rankings are significantly different between languages. Detailed stats & browsable trajectories in 🧵
https://x.com/KLieret/status/2026322986907652295

Having an agentic VLM model, shade & render your 3d scene is the ultimate counter example to the “pixels is all you need” crowd. Real time video is powerful – it’s even a new medium. But explicit 3d is still very useful. Also this donut makes me hungry.
https://x.com/bilawalsidhu/status/2026184423004160185

Can coding agents build entire software systems from scratch? ByteDance, M-A-P, 2077AI, and leading Chinese universities present NL2Repo-Bench, a new benchmark that pushes agents to their limits. It tests if an AI can take a simple text description and autonomously design,
https://x.com/jiqizhixin/status/2025823941642621241

🆕 The End of SWE-Bench Verified (2024-2026) https://t.co/HCmogFFG8w Today @OpenAIDevs is announcing the voluntary deprecation of SWE-Bench Verified! We’re releasing a podcast + analysis in today’s post. Saturation of SWE-Bench has been a community hot topic for over a year –
https://x.com/latentspacepod/status/2026027529039990985

BREAKING: Arrow 1 by @QuiverAI ranks #1 on SVG Arena by Design Arena with an Elo of 1583 It’s the first model to ever break 1500+ on one of our leaderboards, establishing the new SOTA frontier for SVG generation Huge congratulations to the @QuiverAI team for this remarkable
https://x.com/Designarena/status/2027066193946026200?s=20

Document OCR benchmarks are hitting a ceiling – and that’s a problem for real-world AI applications. Our latest analysis reveals why OmniDocBench, the go-to standard for document parsing evaluation, is becoming inadequate as models like GLM-OCR @Zai_org achieve 94.6% accuracy
https://x.com/llama_index/status/2026342120236396844

OmniDocBench is getting saturated VLMs are getting increasingly better at document understanding, from OSS (DeepSeek-OCR2, GLM-OCR), to frontier (Gemini 3, Kimi 5.2, GPT-5.2). A popular benchmark to measure document understanding progress has been OmniDocBench. But we’re
https://x.com/jerryjliu0/status/2026408921385284001

please stop falling for benchmaxxing
https://x.com/scaling01/status/2026698844088549848

The First Fully General Computer Action Model | blog https://si.inc/posts/fdm1/

The rankings on AlgoTune look a bit weird to some people at first, it doesn’t always correlate with rankings on other coding leaderboards- this is because AlgoTune has a $1 limit per task- so cheap models sometimes do much better than smarter but expensive models. I think this
https://x.com/OfirPress/status/2026068384589172800

tl;dr SWE-bench Verified is heavily contaminated for all frontier models, and many of the problems are also broken. Time to move on to harder, uncontaminated coding evals.
https://x.com/polynoamial/status/2026032321212891550

Today, we’re launching a dedicated Multi-File React leaderboard. When Code Arena first launched, we evaluated models on single-file HTML. Then we raised the bar → multi-file React apps (routing, hooks, components, state management) and now have a leaderboard to match!
https://x.com/arena/status/2027114744847720782

We just launched the SWE-bench Multilingual leaderboard! It’s a set of 300 tasks in 9 programming languages; none of these tasks were in SWE-bench Verified. State-of-the-art is 72% here, so lots of room for growth.
https://x.com/OfirPress/status/2026324248973689068

Can an agent survive as a worker in a real economy? Here is a super interesting economic benchmark for AI agents – ClawWork. It’s like a real-world labor market for LLM-based agents that evaluates them in an economic survival loop. ClawWork turns agents into AI coworkers and
https://x.com/TheTuringPost/status/2024960484378816894

Diffusion just rewired the AI speed game. Inception’s Mercury 2 hits 1,000 tokens/second, that’s 10x faster than Claude 4.5 Haiku and GPT-5 Mini, not through custom chips, but through a fundamentally different architecture borrowed from image generators like Midjourney.
https://x.com/kimmonismus/status/2026662718321897974

cool idea from DeepSeek in their DualPath paper! instead of loading all KV’s directly onto GPUs from local NVMe (or DRAM) and bottlenecking on the local PCIe bus, they can stage the KV’s in the DRAM on the decode GPU servers, and then transfer the KV’s to the prefill GPUs via
https://x.com/JordanNanos/status/2027126010576298469

Gemini 3.1 pro scores 72.1% on WeirdML, up from 69.9% for gemini 3.0. Gemini 3.1 seems to have both the highest peak performances of any models, but also some weird weaknesses as well. It uses almost 3 times the number of output tokens as 3.0, considering this, the increase
https://x.com/htihle/status/2025867003550958018

GPT-5.2-chat-latest, the newest model powering ChatGPT, is now in the Text Arena top 5! Highlights: ▪️Top 5 scoring 1478 on par with Gemini-3-Pro ▪️+40pt improvement over the GPT-5.2 model ▪️Top in key categories: Multi-Turn, Instruction-Following, Hard Prompts, Coding A strong
https://x.com/arena/status/2025966052950315340

📊Noticeable improvements with @OpenAI’s GPT-5.2-Chat-Latest vs GPT-5.2 (#5 vs #29 Overall) Where GPT-5.2-Chat-Latest gains: Text: – Coding (+13: #6 vs #19) – Hard Prompts (+21: #4 vs #25) – Instruction Following (+21: #7 vs #28) – Longer Query (+10: #14 vs #24) – English (+33:
https://x.com/arena/status/2025986008484061391

Big news today if you’re into coding evals: SWE-Bench Verified is dead!! https://t.co/SPApcuM5uW i’m not sure if @HamelHusain is tired of me tagging him but it turns out @OpenAI really did look back at their own 2024 work and then you 1) look at the CoT and 2) look at the
https://x.com/swyx/status/2026029120040137066

Code → design → code Generate design files from code, collaborate in @Figma, and implement updates all within Codex without breaking your flow.
https://x.com/OpenAIDevs/status/2027062351724527723

I experienced a very similar transition in December. However, for higher-complexity tasks (ML-related), we are still not there yet. Two days ago I had GPT-5.2-PRO-ET and DeepThink argue for hours, converge, be happy, yet they missed a very obvious math issue. Still a huge unlock
https://x.com/MParakhin/status/2027027034828902421

Introducing WebSockets in the Responses API. Built for low-latency, long-running agents with heavy tool calls. https://x.com/OpenAIDevs/status/2026025368650690932

The Codex app lets you go further, do more in parallel, and go deeper on the problems you care about.”” — @gdb
https://x.com/OpenAIDevs/status/2024212279215198396

The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench
https://x.com/OpenAIDevs/status/2026002219909427270

uhhh WTF?! gpt-5.3-codex gets 86% on IBench, beating out all other models massively. I was NOT expecting this
https://x.com/adonis_singh/status/2026456939224510848

We expanded file input types so you can now pass docx, pptx, csv, xlsx, and more directly to the Responses API. Your agents can now pull context from real-world files and generate more accurate outputs. https://x.com/OpenAIDevs/status/2026420817568084436

We tested @OpenAI’s new WebSocket connection mode for the Responses API into Cline and the early numbers are wild. Instead of resending full context every turn, WebSocket mode keeps a persistent connection, sends only incremental inputs. With 5.2 Codex results vs the standard
https://x.com/cline/status/2026031848791630033

What did you build with Codex this weekend?
https://x.com/OpenAIDevs/status/2025712197100589353

✨ Run it now with SGLang!Chong!
https://x.com/Alibaba_Qwen/status/2026348924433477775

📊With all the Qwen-3.5 scores out for Text, Code and Vision, let’s compare the evolution of Qwen-3.5 (397B-A17B) vs Qwen-3.0 (235B-A22B). This is a +24 rank jump in Text. Specially where Qwen-3.5 gains the most: Text: – Overall (+24: #19 vs #43) – English (+25: #21 vs #46) –
https://x.com/arena/status/2026404630297719100

🔥 Qwen 3.5 Medium Model Series FP8 weights are now open and ready for deployment! Native support for vLLM and SGLang. Check the model card for example code. ⚡️ Optimize your workflow with FP8 precision. 👇 Get the weights: Hugging Face:
https://x.com/Alibaba_Qwen/status/2026682179305275758

🚩Qwen3.5 INT4 model is now available! https://t.co/rY5GrT3b60 @Alibaba_Qwen @JustinLin610
https://x.com/HaihaoShen/status/2026208062009426209

A big jump in intelligence-per-watt today: “”Qwen3.5-35B-A3B now surpasses Qwen3-235B-A22B-2507″”
https://x.com/awnihannun/status/2026353100144218569

Huge thanks to the @vllm_project for the Day-0 support on the Qwen3.5 Medium Series 🚀
https://x.com/Alibaba_Qwen/status/2026496673179181292

Minimax M2.5 GGUFs (from Q4 down to Q1) perform poorly overall. None of them come close to the original model. That’s very different from my Qwen3.5 GGUF evaluations, where even TQ1_0 held up well enough. Lessons: – Models aren’t equally robust, even under otherwise very good
https://x.com/bnjmn_marie/status/2027043753484021810

Qwen 3.5 family is here! > vision built-in, and can outperform previous VL models > designed to be more efficient > expanded support for more languages 35B: (fits on 24GB+ system) ollama run qwen3.5:35b 122B: ollama run qwen3.5:122b 397B (cloud only): ollama run
https://x.com/ollama/status/2026598944177009147

Qwen3.5-35B-A3B is now in Jan 🔥
https://x.com/Alibaba_Qwen/status/2026660582221558190

Qwen3.5-35B-A3B is now live in LM Studio 🚀
https://x.com/Alibaba_Qwen/status/2026496880285462962

Taken at face value, this is… somewhat catastrophic for MoEs, as @YouJiacheng notes. By right, a 397B-A17B ought to have a higher “”power level”” than a dense 27B. Also a big W for Qwen’s integrity and HLE eval quality, I guess. 397B is certainly better at memorization.
https://x.com/teortaxesTex/status/2026690994029072512

the conclusion should not be about moe vs dense, but that you can “”benchmaxx”” (not always a bad thing btw) HLE with tools no matter the model size the difference between Qwen3.5-35B-A3B and Qwen3.5-397B-A17B is only 1 point
https://x.com/eliebakouch/status/2026727151978840105

The new Qwen3.5 Medium models are ready to run 🔥 GGUF support is here! Big thanks to @UnslothAI for making it happen so quickly 🚀
https://x.com/Alibaba_Qwen/status/2026497723944546395

The Qwen3.5 series maintains near-lossless accuracy under 4-bit weight and KV cache quantization. In terms of long-context efficiency: Qwen3.5-27B supports 800K+ context length Qwen3.5-35B-A3B exceeds 1M context on consumer-grade GPUs with 32GB VRAM Qwen3.5-122B-A10B supports
https://x.com/Alibaba_Qwen/status/2026502059479179602

Why benchmarks like Peter’s “”Bullshit Benchmark”” or my ShizoBench matter so much and what do Strawberries have to do with it? I was very skeptical of the performance of Qwen3.5-27B on ArtificialAnalysis leaderboard. So I’m testing the model myself a bit. Naturally I tried the
https://x.com/scaling01/status/2027110908775002312

Qwen3.5-397B-A17B is currently the #1 trending model on Hugging Face. 🏆 This flagship open-weight model is designed for high-performance inference and complex reasoning. 🚀 Try it now on Hugging Face: https://x.com/Ali_TongyiLab/status/2026211680653611174

@MParakhin the more i use opus the more disappointed i get, all the models are worse than useless once you go out of the training distribution. amazing at tailwind landing pages, absolute trash on any advanced ML or data engineering tasks
https://x.com/michalwols/status/2027031882974613836

🔥New Paper drop: The Diffusion Duality (Ch. 2): 𝚿-Samplers #ICLR2026 🚀 Inference‑time scaling for uniform diffusion‑LLMs (Duo) 🥊 Beats Masked diffusion on text + image generation 🔖 https://t.co/dFyLuCkuhR 🌐 https://t.co/q6okZIOcUb 🖥️ https://t.co/oYE9hDYrGI w/
https://x.com/ssahoo_/status/2026487124493742406

35B-A3B is all you need Enjoy
https://x.com/terryyuezhuo/status/2026344442186326332

Becoming great at inference today is similar to entering distributed systems in the early cloud era. The field is young, the problems are technical, and the upside is significant. In Inference Engineering, @philipkiely lays out the stack from CUDA to Kubernetes and connects it
https://x.com/JayminSOfficial/status/2025996744509804865

big fan of ontology btw, but noted. building the proprietary data fusion layer this weekend 🫡
https://x.com/bilawalsidhu/status/2024984046997049514

Chinese labs have started to publish serious papers on CoT engineering. We’re moving from simplistic length penalties and curricula to integrated pipelines forcing compression. But it’s still not bitter enough. I hope ByteDance’s crazy work will lead to some synthesis.
https://x.com/teortaxesTex/status/2025817199764500789

damn i love our medium size stuffs. things i use locally. oh try 122, sth new and sth quite nice!
https://x.com/JustinLin610/status/2026343725719568395

Data has fueled progress in AI over the last 2 decades. So, it is our natural starting point. Our North Star is to make the entire AI stack from data to interface adaptable. Today we announce our early access program to Adaptive data. ✨
https://x.com/adaption_ai/status/2026281291847446721

Every AI system is a reflection of its data. But most AI is built on static datasets — frozen snapshots of a reality that keeps moving. Inputs shift. Usage evolves. Your AI drifts. Adaptive Data closes that loop. 82% avg quality gains. 242 languages. Early access is open.
https://x.com/sudip_r0y/status/2026286762851774475

I spent 100 hours over the past week researching, writing and editing the piece we just put out. It’s a scenario, not a prediction like most of our work. But it was rigorously constructed, dismissing it outright requires the kind of intellectual laziness that tends to get
https://x.com/Citrini7/status/2025668400396349476?s=20

I spent a humiliating amount of time learning how to make animated graphs, just to illustrate a fairly obvious point. “Forecasting s-curves is hard” My views on why carefully following daily figures is unlikely to provide insight. https://x.com/clcrozier/status/1251148890595708938

I’m curious. 🧐 What model harness is working best for you?
https://x.com/PaulSolt/status/2023534305855856726

Impressive inference speed from Inception Labs’ diffusion LLMs. Diffusion LLMs are a fascinating alternative to conventional autoregressive LLMs. Well done @StefanoErmon and team!
https://x.com/AndrewYNg/status/2026478474681262576

Inception Labs has launched Mercury 2, their next generation production-ready Diffusion LLM. Mercury 2 achieves >1,000 output tokens/s with significant gains in intelligence @_inception_ai’s Diffusion LLMs (“dLLMs”) use a different architecture compared to autoregressive based
https://x.com/ArtificialAnlys/status/2026360491799621744

Inference Engineering launches today. https://x.com/philipkiely/status/2025994823891914795

Inference is the most under-discussed layer in AI. Everyone talks about model training, yet production inference is where companies win or lose on latency, cost, and reliability. In his new book Inference Engineering, @philipkiely breaks down the full stack from GPUs to
https://x.com/hasantoxr/status/2025996746133049498

Instead of scaling reasoning length, sometimes it’s more useful to work on reasoning strategy @ZJU_China and @AntGroup presented InftyThink+ It’s a method that teaches models when to pause, summarize and continue reasoning in iterative loops using trajectory-level RL. This new
https://x.com/TheTuringPost/status/2024965754878365745

Interesting study on properly using PDF data with LLMs. OCR extraction / text-based representation has a clear edge over image representations on QA tasks. For pure info retrieval, both images and text perform similarly, and we can also combine the techniques together into a
https://x.com/cwolferesearch/status/2026344301907583469

Introducing Mercury 2 – Inception https://www.inceptionlabs.ai/blog/introducing-mercury-2

Meet LFM2-24B-A2B, @liquidai’s largest model. 24B MoE, 2B active. Blazing fast even on CPU. Available now in LM Studio 👾💧
https://x.com/lmstudio/status/2026322404142633131

Mercury 2 doesn’t just make reasoning models faster. It makes them native. Every reasoning model today is built on autoregressive generation, where the model writes one word at a time, left to right, like typing on a keyboard. Each word waits for the previous one to finish.
https://x.com/LiorOnAI/status/2026376138428395908

Mercury 2 is live 🚀🚀 The world’s first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs. Watching the team turn years of research into a real product never gets old, and I’m incredibly proud of what we’ve built. We’re just getting
https://x.com/StefanoErmon/status/2026340720064520670

MoEs are first class citizens in transformers now 🔥 so we dropped an explainer on all the MoE goodies in transformers 🤗 don’t walk, run to read
https://x.com/mervenoyann/status/2026999892099354853

MultiShotMaster A Controllable Multi-Shot Video Generation Framework paper: https://x.com/_akhaliq/status/2025951076579475640

New blog post: how we built infrastructure to enable interp at trillion-parameter scale with minimal inference overhead. In a couple short years, interpretability has gone from toy models to the frontier. (1/6)
https://x.com/GoodfireAI/status/2026748839303246238

New optimizer with earth-shattering plots making the rounds, and published in Nature too (Machine Intelligence, but let’s just drop that part.) So of course I had to take a quick look. A few things I noticed that make me a bit sus, though I’m not saying to outright discard it –
https://x.com/giffmana/status/2026223201957597563

New research on making chain-of-thought reasoning actually monitorable. LLMs show their reasoning through a chain of thought. But can we actually trust what’s in those reasoning traces? Can monitors reliably detect when something goes wrong? This paper applies information
https://x.com/dair_ai/status/2026043400861122709

Next edit suggestions just leveled-up in @code: with long-distance NES, you get edit suggestions anywhere in your file, not just near your cursor’s position. Learn how the team built this – creating the training dataset, refining the UX, evaluating success, & more:
https://x.com/code/status/2027093279762747526

Nodes aren’t the future of AI creation. Here’s what comes next. 0:00 – Nodes aren’t the future 0:32 – The viewport problem 1:38 – Plumbing vs filmmaking 2:40 – No viewport; just a slot machine 3:22 – Stacks, nodes, and spatial engines 5:02 – New Lego blocks every two weeks 6:49
https://x.com/bilawalsidhu/status/2025674072097431851

Octo-Bouncer. Highly precise stepper motor driving with a Teensy 4.0 and custom pulse generating algorithm and PC based image processing with the goal of getting a machine to juggle a ping pong ball. [📍 bookmark for later to give it a try!] GitHub: https://x.com/IlirAliu_/status/2025649078248308764

ok silly question but why did it take so long to add websocket support or more like why was it not provided earlier
https://x.com/dejavucoder/status/2026219239477215657

Read this blogpost! But above all, read it as a starting point for discussion, so that we can finally have a broad and joint conversation about the near future and search for solutions together! Citrini Research sketches a “”future macro memo”” scenario where rapidly improving,
https://x.com/kimmonismus/status/2025914288439771171

soon i will have inference kernels for sm80, sm87, sm89, sm90, sm100, and sm110. going to skip sm120 because i don’t know anyone running 5090s after that, i can rest for a quarter before vera rubin hits the market
https://x.com/vikhyatk/status/2027002892083986624

Special cross-post featuring @olive_jy_song of @MiniMax_AI, from @swyx’s @aiDotEngineer and @Kseniase_’s @TheTuringPost From the ups, downs, & surprises of RL post-training, to battles with reward hacking, to painstaking debugging… Chinese researchers are … just like us!
https://x.com/labenz/status/2025735906762936699

Spiders use hydraulic pressure rather than antagonistic muscles to extend their legs. In 2022, Rice University researchers triggered movement by injecting air into dead spiders, creating grippers that can lift 130% of their body weight. The study introduced the term
https://x.com/TheHumanoidHub/status/2025652927659159760

Strong Long CoT reasoning has an internal structure, kind of like a molecule. Here are 3 behavior types @ByteDanceOSS highlighted in their recent research: 1. Deep reasoning – It’s about the main reasoning flow with connected logical steps which is like strong chemical bonds
https://x.com/TheTuringPost/status/2026050264122462370

Surprisingly, your reconstruction engine like VGGT is their own best teacher: they can get much stronger through their own self-labelled data! Big kudos especially to @denghilbert and @lucyrchai!
https://x.com/songyoupeng/status/2025965685055320369

Thanks @_akhaliq for sharing our work! We have released a multi-shot video generation model for the community. Code: https://t.co/OQncWQwlRP Project Page: https://t.co/cLgM9c6fZH Our work has been accepted to #CVPR2026! Also First Prize 🥇 at #AAAI2026 CVM Contest 2026!
https://x.com/QingheX42/status/2025953650334679410

That’s an interesting part about training a model, but I also really loved when Olive reflected on what she thought research would be like versus what it actually turned out to be in real life. The interview is worth watching, and I’m thankful to @labenz for sharing it with his
https://x.com/TheTuringPost/status/2025749469178798264

THE 2028 GLOBAL INTELLIGENCE CRISIS https://www.citriniresearch.com/p/2028gic

The Mixture of Experts (MoE) inside 🤗 Transformers is out now! This is going to be a long tweet, so if you just want to jump to the blog, the link is in the thread. We already had a great blog post on MoEs (which has more than 1k upvotes 😯 at the time of writing). The reason
https://x.com/ariG23498/status/2026995823536751072

The path to ubiquitous AI | Taalas https://taalas.com/the-path-to-ubiquitous-ai/

this has been the most exciting project to work on. a lot of foundational infra we’ve built over time (search engine, browsing cloud, memory engine, and more) all came together in one platform. this was also built by a small group of people and a lot of AI coding agents.
https://x.com/denisyarats/status/2026704583817634180

This is the first article I’ve ever written with the express hope that I am wrong. People discussing the topics raised, becoming more proactive and being aware of the risks inherent to what’s happening in technology is how that happens. I’m glad people are trying to prove or
https://x.com/Citrini7/status/2025980800659792270

three categories of reactions to Citrini I see – agreement – nuanced disagreement on mechanics, timelines, damage areas – tryhard sneering to obscure existential dread
https://x.com/teortaxesTex/status/2025894184817684633

Very interesting work that reminds me of driverless cars. Back in 2015 it seemed like they would imminently be everywhere, yet 11 years later, they’re still hard to find (although I am obsessed with Waymoing now). The long tail of rare but critical failures has turned out to be
https://x.com/ahall_research/status/2026338695536848987

We found that with Hermes 4, and its extremely improved math capabilities, its ability to correctly judge math solutions improved proportionally – I’m not convinced you can have a free lunch with a smaller stupider model judging hard problems in a lot of cases. I think it does
https://x.com/Teknium/status/2025740765230682400

We heard your feedback that you wanted better next edit suggestions for far-away edits. Our new long-distance NES solves exactly this!
https://x.com/pierceboggan/status/2027107798061044219

We were able to decompose #reliability into 12 different dimensions. Evaluating 14 models on two complementary benchmarks, we found that nearly two years of rapid capability progress have produced only modest reliability gains.”” #ethics #AI #tech #research
https://x.com/IEthics/status/2026435186704134617

Website: https://t.co/xTaDXBu9cD Codebase and weights: https://t.co/QCQkqPIsHI Whitepaper: https://t.co/K2QCFjboDR Check out @zhengyiluo’s post:
https://x.com/DrJimFan/status/2026350144300658891

What a day that has been defined by the @Citrini7 essay! I wonder how many will still feel the same way about it when they wake up tomorrow as they do today. This is truly a thought-provoking piece, even more so than I had initially appreciated when we chatted about it. I can
https://x.com/stevehou/status/2025797519028936854

WTF Happened In 1971? https://wtfhappenedin1971.com/

WTF Happened in 2025? https://wtfhappened2025.com/

AI 2027 https://ai-2027.com/

How Teens Use and View AI | Pew Research Center https://www.pewresearch.org/internet/2026/02/24/how-teens-use-and-view-ai/

Leave a Reply

Trending

Discover more from Ethan B. Holland

Subscribe now to keep reading and get access to the full archive.

Continue reading