Benchmarks: AI News Week Ending 04/17/2026

Benchmarks: AI News Week Ending 04/17/2026

April 17, 2026

Image created with gemini-3.1-flash-image-preview with claude-opus-4.7. Image prompt: Using the provided reference image, preserve every element exactly — the marigold-orange backdrop, the seated woman’s closed-eyes smile and purple windbreaker, the tattooed singer in the red beanie and layered red vest, the lighting and intimate framing — but replace only the black handheld microphone with a vintage analog VU meter unit held to his mouth, its glass face and quivering needle pegged near the red zone, gripped identically to the original mic with seamless photographic realism and matching studio lighting. After generating the image, overlay the text “Benchmarks” in the upper-left corner of the frame in large, bold, all-caps ITC Avant Garde Gothic Pro Medium (or a near-identical geometric sans-serif if unavailable), pure white (#FFFFFF), with no date, subtitle, drop shadow, or outline. The text should be substantial in scale — taking up a meaningful portion of the upper-left area — with comfortable margin from the top and left edges, set against the negative space of the orange backdrop so it does not overlap or obscure the singer, the seated woman, or the replaced object.

So the concern over Mythos and cybersecurity seems warranted.
https://x.com/emollick/status/2043810051979157680

This was… an interesting one. Reminder that we run independent evals on our cyber ranges that labs don’t have access to. Exploitation capabilities are getting seriously good. Mythos is the first model to complete our full 32-step corporate network attack sim E2E.
https://x.com/ekinomicss/status/2043688793085992970

Anthropic launched Claude Opus 4.7 today, the new #1 in our GDPval-AA benchmark for performance on agentic real-world work tasks Opus 4.7 scored 1753 on GDPval-AA at launch with its ‘max’ effort setting, surpassing GPT-5.4 xhigh. This is a significant upgrade, placing Opus back
https://x.com/ArtificialAnlys/status/2044856740970402115

Anthropic says Opus 4.7 hits 80.6% on Document Reasoning — up from 57.1%. But “”reasoning about documents”” ≠ “”parsing documents for agents.”” We ran it on ParseBench. → Charts: 13.5% → 55.8% (+42.3) — huge → Formatting: 64.2% → 69.4% (+5.2) → Content: 89.7% → 90.3%
https://x.com/llama_index/status/2044886527352647859

Anthropic’s Opus 4.7 just seized the #1 spot on the Vals Index with a score of 71.4%, a massive jump from the previous best (67.7%). It also ranks #1 on Vibe Code Bench, Vals Multimodal, Finance Agent, Mortgage Tax, SAGE, SWE-Bench, and Terminal Bench 2.
https://x.com/ValsAI/status/2044792518953533777

big jump in coding capabilities by Claude 4.7 Opus SWE-Bench Pro 64.3% SWE-Bench Verified 87.6% TerminalBench 69.4% but interestingly, I think they kept CyberGym scores artificially low
https://x.com/scaling01/status/2044784563201708379

Claude 4.7 Opus has an Elo of 1753 on GDPVal-AA
https://x.com/scaling01/status/2044784781368365233

Claude Opus 4.7 is out! Benchmark scores look pretty strong, but clearly much worse than Mythos. It’s a nerfed Mythos, they deliberately reduced cyber capabilities during training.
https://x.com/Yuchenj_UW/status/2044787564440334350

Document Arena update: four new models are reshaping the top ranks – including two open models! – #1 Claude Opus 4.6 Thinking is new, keeping @AnthropicAI in the top 3 – #8 Kimi-K2.5 Thinking by @Kimi_Moonshot now the best open model (Modified MIT) – #10 Gemma-4-31b by
https://x.com/arena/status/2044437193205395458

Document reasoning increased by A LOT for Opus 4.7
https://x.com/scaling01/status/2044784878965703100

Introducing Claude Opus 4.7 \ Anthropic
https://www.anthropic.com/news/claude-opus-4-7

New Anthropic Fellows research: developing an Automated Alignment Researcher. We ran an experiment to learn whether Claude Opus 4.6 could accelerate research on a key alignment problem: using a weak AI model to supervise the training of a stronger one.
https://x.com/AnthropicAI/status/2044138481790648323

Nontheless Opus 4.7 scores much higher on Firefox shell exploitation
https://x.com/scaling01/status/2044788243435069764

OpenAI just dropped a major Codex update, one hour after Anthropic’s Opus 4.7. Whats new: background computer use on macOS (Codex clicks and types on your Mac while you keep working), in-app browser, image generation via gpt-image-1.5, persistent memory, long-running
https://x.com/kimmonismus/status/2044832303075995994

Opus 4.7 first-hour impressions Ran the canvas tree growth test twice. 4.6: nailed the animation both times 4.7: static tree, no growth animation — twice 4.7’s thinking is noticeably shorter and faster though (trimmed some 4.6 thinking in the clip for pacing). Not the upgrade
https://x.com/stevibe/status/2044800069661254064

Opus 4.7 scores 92% on ARC-AGI-1 and 75.83% on ARC-AGI-2
https://x.com/scaling01/status/2044791039605506344

The new Opus 4.7 model places #1 on our Vibe Code Benchmark, at 71%. When we first released the benchmark 4.5 months ago, no model scored above 25%. This benchmark tests a model’s ability to create a fully functional web application from the ground up.
https://x.com/ValsAI/status/2044791415524471099

We comprehensively benchmarked Opus 4.7 on document understanding. We evaluated it through ParseBench – our comprehensive OCR benchmark for enterprise documents where we evaluate tables, text, charts, and visual grounding. The results 🧑‍🔬: – Opus 4.7 is a general improvement
https://x.com/jerryjliu0/status/2044902620746363016

What are the largest software engineering tasks AI can perform? In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks. Co-developed with @METR_Evals. Details in thread.
https://x.com/EpochAIResearch/status/2042624189421752346

What you need to know about Opus 4.7 * Takes instructions literally * Better vision means improved computer use and producing slides and other visual artifacts * Optimized for large-scale real-world analysis * Better at using file system-based memory
https://x.com/omarsar0/status/2044797480471044536

Wow I can already say after just 5 hours using @AnthropicAI Opus 4.7 that this is the first model that “”gets”” what I’m doing when I’m working. It feels aligned with me in a way no previous model did. (4.6 actively worked against me. I hated it. So this is *very* exciting!)
https://x.com/jeremyphoward/status/2044942799511191559

Anthropic co-founder confirms the company briefed the Trump administration on Mythos | TechCrunch

Anthropic co-founder confirms the company briefed the Trump administration on Mythos

First model from Anthropic, which openly acknowledges it isn’t the best model they have
https://x.com/nrehiew_/status/2044791293080121553

Internal Anthropic survey on Claude Mythos Preview 12/18 people thought that Mythos can manage day long ambiguous tasks 8/18 thought that it can execute week long tasks
https://x.com/scaling01/status/2044787521691742338

Nearly 1/3 of surveyed people in Anthropic now think entry-level engineers and researchers are likely replaced by Mythos within 3 months
https://x.com/arankomatsuzaki/status/2044808883928186936

Read OpenAI’s latest internal memo about beating the competition — including Anthropic | The Verge
https://www.theverge.com/ai-artificial-intelligence/911118/openai-memo-cro-ai-competition-anthropic

Five hyperscalers now own over two-thirds of global AI compute
https://epochai.substack.com/p/five-hyperscalers-now-own-over-two

⚡ Meet Qwen3.6-35B-A3B：Now Open-Source！🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license. 🔥 Agentic coding on par with models 10x its active size 📷 Strong multimodal perception and reasoning ability 🧠 Multimodal thinking + non-thinking modes
https://x.com/Alibaba_Qwen/status/2044768734234243427

LM Performance：Qwen3.6-35B-A3B outperforms the dense 27B-param Qwen3.5-27B on several key coding benchmarks and dramatically surpasses its direct predecessor Qwen3.5-35B-A3B, especially on agentic coding and reasoning tasks.
https://x.com/Alibaba_Qwen/status/2044768738294268199

VLM Performance：Qwen3.6 is natively multimodal, and Qwen3.6-35B-A3B showcases perception and multimodal reasoning capabilities that far exceed what its size would suggest, with only around 3 billion activated parameters. Across most vision-language benchmarks, its performance
https://x.com/Alibaba_Qwen/status/2044768742761189762

Alibaba released Qwen3.6-35B-A3B today. Big jump compared to Qwen 3.5-35B model. It’s a sparse MoE, 35B total params, only 3B active. Natively multimodal, thinking and non-thinking modes. Hardfacts: SWE-bench Verified: 73.4, near dense Qwen3.5-27B (75.0), way ahead of
https://x.com/kimmonismus/status/2044780695361290347

All is not lost. Duckerton is still possible. Here is Seedance 2.0 with the same prompt.
https://x.com/emollick/status/2042455596834660479

Agent evals are drifting away from production reality. Most benchmarks use clean tasks, well-specified requirements, deterministic metrics, and retrospective curation. Production work is messier, with implicit constraints, fragmented multimodal inputs, undeclared domain
https://x.com/dair_ai/status/2044773323914322393

Doing my “”large codebase modernization”” bench. Cooked for 32 minutes. Looking reasonable so far but it missed the changes to the Link component in Next.js (almost everything has missed this to be fair)
https://x.com/theo/status/2044907295205961806

Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed
https://x.com/MatternJustus/status/2044876224896565679

Scaling to ultra-long horizon agents requires novel benchmarks and RL environments. FrontierSWE by @ProximalHQ is exactly that: 11h average runtime, open-ended tasks like end-to-end model optimization, and frontier agents fail almost all of them. We co-designed granite_inf,
https://x.com/vincentweisser/status/2044923733048222197

Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the “”winner”” among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the
https://x.com/Azaliamirh/status/2043813128690192893

we just shipped Kernels, it’s a new repo at @huggingface 💚 it allows for packaging and distribution of optimized kernels 🔥 vibe-optimize Kernels, benchmark gains and share them on Hub 🫵
https://x.com/mervenoyann/status/2044080953648128073

We partnered with @ProximalHQ to run five frontier coding agents on a hard task: rebuild the full Wan 2.1 text-to-video pipeline on MAX (no PyTorch, no diffusers) in 20 hours as part of their new Frontier-SWE benchmark. Two nearly pulled it off. Every model understood the
https://x.com/Modular/status/2044879525881024968

Current frontier models are increasingly saturating common AI benchmarks. Are they still useful? We think benchmarks remain important, but they can both over- and understate AI capabilities. To better survey this space, the field is turning to a new paradigm: open-world evals.
https://x.com/steverab/status/2044852672562426216

I’m pleased to share that our search team has open sourced an embedding model called Harrier that is currently ranking #1 on the multilingual MTEB-v2 benchmark leaderboard. Harrier delivers SOTA performance on retrieval quality, semantic matching, and contextual analysis across
https://x.com/JordiRib1/status/2041550352739164404

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis

Our latest Live model is # 1 on Tau Voice Bench! Excited to see this new frontier of voice models cross the chasm of usability in production.
https://x.com/OfficialLoganK/status/2042672082425712935

significant improvement on coding and agentic benchmarks. better at computer vision and a new xhigh mode
https://x.com/dejavucoder/status/2044786310746186094

We’re open sourcing the first document OCR benchmark for the agentic era, ParseBench. Document parsing is the foundation of every AI agent that works with real-world files. ParseBench is a benchmark that measures parsing quality specifically for agent knowledge work: ✅ It
https://x.com/jerryjliu0/status/2043721536922955918

Marcus Hutchins, the guy famous for stopping the WannaCry Ransomware, probably has the best take on Mythos doing vulnerability research
https://x.com/ananayarora/status/2043381424594837789

The Mythos Threshold – Joe Reis
https://joereis.substack.com/p/the-mythos-threshold

What I learned this week – Pretraining parallelisms, Can distillation be stopped, Mythos and the cybersecurity equilibrium, Pipeline RL, On why pretraining runs fails
https://www.dwarkesh.com/p/what-i-learned-april-15

2 prompts deep into Opus 4.7 and benchmarks don’t do it justice. Way better behavior and instruction following. Pretty massive improvement in actual usage.
https://x.com/mweinbach/status/2044801022439137566

3. Tell the model how to verify its changes. Put your testing workflow in your claude.md, or add a /verify-app skill. Opus 4.7 is better at verifying it’s work, and it’s helpful to share any local dev tips that are hard to discover.
https://x.com/_catwu/status/2044808538351100377

after ~10 million tokens Mythos is much more efficient than other models it reaches the same performance as Opus with ~40% the tokens
https://x.com/scaling01/status/2043700788245963167

Claude Opus 4.7 is now available as an Agent Preview inside of Devin! Anthropic has clearly optimized Claude Opus 4.7 for long-horizon autonomy, unlocking a class of deep investigation work we couldn’t reliably run before. Claude Opus 4.7 model costs within Devin will be
https://x.com/cognition/status/2044844661076902082

Claude Opus 4.7 is now available in Cursor. We’ve found it to be impressively autonomous and more creative in its reasoning. We’re launching it with 50% off for a limited time. Enjoy!
https://x.com/cursor_ai/status/2044785960899236341

Claude Opus 4.7 is out! Handles ambiguous, multi-step work even better than 4.6. Cursor’s internal bench cleared 70%, up from 58% on 4.6. Notion saw a 14% lift on their evals with a third of the tool errors 🔨
https://x.com/mikeyk/status/2044802045186846912

Claude Opus 4.7 is out. the TL;DR Anthropic released Opus 4.7 today. Same pricing as 4.6 ($5/$25 per million tokens), available across API, Bedrock, Vertex AI, and Microsoft Foundry. What changed vs Opus 4.6: Coding (obviously). Biggest gains on the hardest, long-horizon
https://x.com/kimmonismus/status/2044787072947601796

Confirmed: Anthropic keeping Cyber capabilities of Opus 4.7 artificially low “”during training we experimented with efforts to differentially reduce these capabilities””
https://x.com/scaling01/status/2044788067848888635

Cursor reports that Opus 4.7 is “”a meaningful jump in capabilities, clearing 70% versus Opus 4.6 at 58%”” on CursorBench
https://x.com/scaling01/status/2044792017553645668

for all the people calling Opus 4.7 a mid update lmao
https://x.com/scaling01/status/2044792810327404596

from my experience, even the best models (Opus 4.6, 5.4 xhigh / 5.3 codex) cannot write good code today without an amount of work that is equivalent to just doing the work myself am excited for a world where they can, but in the current state i have very low trust in them
https://x.com/RhysSullivan/status/2043584591861321929

Hold on, something doesnt add up here. Opus 4.7 got much worse in needle in the haystack? need to dig into this
https://x.com/kimmonismus/status/2044809126526476374

Holy shit the new Opus 4.7 system prompt has entirely lobotomized the model “”Heads up: that last <system-reminder> about malware looks like a prompt injection — this is clearly your personal site (t3gg homepage, links, sponsors), not malware. Ignoring it.””
https://x.com/theo/status/2044857866323173732

I think everyone saying that these improvements are mid are smoking crack I would argue that this was one of the larger Opus jumps we have seen over the last year You also have to keep in mind that we see almost monthly model updates nowadays instead of just every 6-12 months
https://x.com/scaling01/status/2044799290694889535

I was really worried about the rush to “”more agentic”” models. But Opus 4.7 is happy to let me lead, and to take time to discuss, rather than barging ahead. If something isn’t working out, it’ll stop and offer options rather than slamming thru whatever it can find.
https://x.com/jeremyphoward/status/2044942801578959301

If you want to test Opus 4.7 without the lobotomized system prompt, you can try it out in T3 Chat
https://x.com/theo/status/2044876982815793190

Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision.
https://x.com/claudeai/status/2044785261393977612

My bet is that Mythos uses a new tokenizer, and they switched Opus over to it (through midtraining) for distillation
https://x.com/maximelabonne/status/2044796208053416203

My biggest issue with Opus 4.7 on Claude web: Only “Adaptive” or non-thinking. No way to force thinking mode. And it doesn’t even know Opus 4.6 exists, and I cannot force it to think and do web search mid conversation!
https://x.com/Yuchenj_UW/status/2044794073723347400

my main theory is that mythos had a new tokenizer for pretraining and they did surgery on opus for distillation
https://x.com/stochasticchasm/status/2044790474410790995

my take: opus 4.7 is a distilled version of mythos
https://x.com/eliebakouch/status/2044790074093523379

Opus 4.7 as robust to prompt injections as Claude Mythos
https://x.com/scaling01/status/2044788481008755046

Opus 4.7 Benchmarks out! Very solid upgrade to Opus 4.6! Compared to Opus 4.6: -SWE Bench Pro +11% -SWE Bench Verified +7% -Terminal Bench 2.0 +4% The benchmarks are significantly lower than for Mythos, but that was to be expected. h/t for finding @synthwavedd
https://x.com/kimmonismus/status/2044784903733084521

Opus 4.7 comes with much improved reasoning-efficiency over Opus 4.6 basically everything is now moved up one tier low is as good as medium medium as good as high high as good as max
https://x.com/scaling01/status/2044785467942453698

Opus 4.7 deleting all long-context gains from Opus 4.6 lol
https://x.com/scaling01/status/2044791314898723179

Opus 4.7 has a new tokenizer. This means it’s also a new base model. Glory days of pretraining still very much going.
https://x.com/natolambert/status/2044788470179332533

opus 4.7 is here on claude platform / app
https://x.com/dejavucoder/status/2044784097378316327

Opus 4.7 is live in Claude Code today! The model performs best if you treat it like an engineer you’re delegating to, not a pair programmer you’re guiding line by line. Here are three workflow shifts we recommend for this model 🧵
https://x.com/_catwu/status/2044808533905178822

Opus 4.7 is now available in @MagicPathAI. From our early testing, the model is really strong at long tasks when design requires lots of changes, image-to-code, and overall produces cleaner, more reusable React components.
https://x.com/skirano/status/2044804877696516442

Opus 4.7 is WORSE than 4.6 on Long Context?
https://x.com/nrehiew_/status/2044795171213291614

Opus 4.7 much less likely to sudo rm -rf (taking destructive actions in production envs)
https://x.com/scaling01/status/2044789371837001779

Opus 4.7 uses a different tokenizer from Opus 4.6. So either: – Anthropic has a way to change tokenizer between finetunes – It is just new special tokens which implies they uses special tokens liberally within messages and not just as part of the chat template
https://x.com/nrehiew_/status/2044792314825228690

Opus 4.7 uses more thinking tokens, so we’ve increased rate limits for all subscribers to make up for it. Enjoy!
https://x.com/bcherny/status/2044839936235553167

Opus is going to be a bioweapon risk at this pace
https://x.com/scaling01/status/2044785139905913077

Some of my favorite things in Opus 4.7: – Very good at async work and following instructions – Effort levels are far more predictable for token control (+ new xhigh level) – No more downscaling of high-res images – Noticeably more taste in UIs, slides, docs
https://x.com/alexalbert__/status/2044788914813292583

Unfortunately they didn’t include a chart for GraphWalks scores: Opus 4.6 – 38.7% Opus 4.7 – 58.6% This would make clearer that long-context didn’t suffer as much as MRCR suggests.
https://x.com/scaling01/status/2044823423013020088

wait why is there an INSANE gap on long context benchmarks between opus 4.6 and 4.7??? this is crazy
https://x.com/eliebakouch/status/2044798168211100096

We’ve set the default effort level for Opus 4.7 to xhigh in Claude Code. You can use /effort to adjust this. Excited for you to try Claude Code with Opus 4.7 and let us know your feedback!
https://x.com/_catwu/status/2044808539663978970

Shocking result on my pelican benchmark this morning, I got a better pelican from a 21GB local Qwen3.6-35B-A3B running on my laptop than I did from the new Opus 4.7! Qwen on the left, Opus on the right
https://x.com/simonw/status/2044830134885306701

@stochasticchasm yeah they tend to forget that releases are now monthly and now bi-anually
https://x.com/scaling01/status/2044795960224592329

Anthropic Changes Pricing to Bill Firms Based on AI Use as Demand Jumps — The Information
https://www.theinformation.com/articles/anthropic-changes-pricing-bill-firms-based-ai-use-amid-compute-crunch

Anthropic introduced xhigh reasoning effort
https://x.com/scaling01/status/2044785557058814059

Anthropic loses Claude Code trust in black-box fight
https://www.implicator.ai/claude-probably-wasnt-secretly-nerfed-anthropic-made-the-black-box-too-dark/

Anthropic tests Claude Code upgrade to rival Codex Superapp
https://www.testingcatalog.com/anthropic-tests-claude-code-upgrade-to-rival-codex-superapp/

anthropic? you mean the greedy token guzzler company?
https://x.com/dejavucoder/status/2044798065530528061

every engineer at anthropic has been using mythos for ~1.5 months. meanwhile, their uptime is horrendous, claude code still has rendering bugs, etc. one could conclude that it won’t be the end of software engineering.
https://x.com/benhylak/status/2042051048261722467

GitHub reports similar improvements
https://x.com/scaling01/status/2044792459125834029

OpenAI has released a plugin that lets you call Codex directly within Anthropic’s Claude Code environment It turns Claude Code into a multi-agent setup with Codex as a specialized coding assistant This gives you: – High-quality code reviews – Delegation of real tasks
https://x.com/TheTuringPost/status/2044561927905677558

So we now have a pretty good picture of the state of the frontier AI model makers. US closed source models continue to lead. Google, OpenAI, and Anthropic stand well ahead of the pack, and may have signs of recursive self-improvement. xAI has fallen from frontier status for now
https://x.com/emollick/status/2042088011748290750

The pace at which Anthropic is shipping Opus variants is a very new thing in the industry.
https://x.com/_arohan_/status/2044791678180167804

The pace at which useful things are shipping also seems to be accelerating. Model releases are coming faster, of course, but so are significant application and enterprise products (especially from Anthropic). Almost certainly faster than the market can track or absorb information
https://x.com/emollick/status/2042434850003534077

we were literally stuck at 80% SWE-Bench Verified for months and just jumped to almost 90% and you guys call it mid …
https://x.com/scaling01/status/2044790717722034511

Yeah folks, it’s gonna be harder in the future to ensure OpenClaw still works with Anthropic models.
https://x.com/steipete/status/2042615534567457102

(14) ARC-AGI-3 – YouTube

@NBCNews on our recent AI usage survey:
https://x.com/EpochAIResearch/status/2044208011024142594

Buckle up everyone, your API costs are going up, not down.
https://x.com/madiator/status/2044801082359210215

New Eval mode: Battles in Direct We sample two random anonymous models during Direct chats – enabling pairwise comparison beyond turn 1. Why this matters: • Evaluates under longer context + multi-turn dependency • Captures failure modes: drift, consistency, recovery • Closer
https://x.com/arena/status/2044096836114493609

New models, new prompts. Perhaps the most valuable reason to be using GEPA. If you’ve got GEPA set up, migrating prompts takes a couple clicks. If you don’t? Get ready for some tedious prompt engineering over the next week or two.
https://x.com/dbreunig/status/2044794013375770915

Today we’re releasing SWE-check, a specialized bug detection model we RL-trained with @appliedcompute that matches frontier performance on internal in-distribution evals and makes meaningful progress on out-of-distribution evals, all while running 10x faster.
https://x.com/cognition/status/2044174496312242544

We are excited to host @ProximalHQ’s FrontierSWE on the Environments Hub as a launch partner. As an ultra-long horizon coding evaluation, even today’s frontier models struggle to solve the tasks after running for hours.
https://x.com/PrimeIntellect/status/2044878952020554083

AI is changing our jobs: among people who use AI regularly at work, 27% say AI has replaced some of their tasks; 21% say it has enabled new tasks. This and other workplace usage findings from our new Epoch AI/Ipsos survey on AI usage in 🧵
https://x.com/EpochAIResearch/status/2042302337059078605

We estimate that Gemini 3.1 Pro with thinking level `high` has a 50%-time-horizon of around 6.4 hrs (95% CI of 4 hrs to 12 hrs) on our suite of software tasks.
https://x.com/METR_Evals/status/2044463380057194868

🎉 Congrats @Alibaba_Qwen on the first open-weight Qwen3.6! Stronger agentic coding and a new thinking preservation option to retain reasoning context across turns. Same architecture as Qwen3.5, so serving teams can upgrade in place. Day-0 support in vLLM v0.19+. Thinking, tool
https://x.com/vllm_project/status/2044787721538060784

Introducing Nucleus-Image: the first sparse Mixture-of-Experts diffusion model 17B parameters. Only 2B active. 10x more parameter-efficient than leading diffusion models. Toe-to-toe with GPT Image 1, Imagen 4, and Qwen-Image: from pure pre-training alone. No DPO. No RL. No
https://x.com/withnucleusai/status/2044412335473713284

Qwen/Qwen3-Coder-Next · Hugging Face
https://huggingface.co/Qwen/Qwen3-Coder-Next

We built FrogsGame as a new task for evaluating AI’s posttraining skills! It’s a tool-using RL environment built around a blind-start interaction loop. Frontier agents get a container with the Qwen3-8B tokenizer, board-generating scaffolding, and @tinkerapi for remote training
https://x.com/karinanguyen/status/2044885375085339023

2-bit Qwen3.6-35B-A3B did a complete repo bug hunt with evidence, repro, fixes, tests and a PR writeup. 🔥 Run it locally in Unsloth Studio with just 13GB RAM. 2-bit Qwen3.6 GGUF made 30+ tool calls, searched 20 sites and executed Python code. GitHub:
https://x.com/UnslothAI/status/2044858346948464743

Qwen3.6-35B-A3B can now be run locally!💜 The model is the strongest mid-sized LLM on nearly all benchmarks. Run on 23GB RAM via Unsloth Dynamic GGUFs. GGUFs to run:
https://t.co/VlyW8UwDjw Guide:
https://x.com/UnslothAI/status/2044786492451778988