Image created with gemini-3.1-flash-image-preview with claude-sonnet-4-5. Image prompt: Using the provided reference image, preserve the exact square faceted perfume bottle with crystal stopper, pure white background, soft shadow, and high-fashion studio lighting. Replace the label text with ‘AGI’ in the same clean black serif font. Change the liquid inside to an iridescent pearl-gold that subtly shifts in tone, suggesting multifaceted intelligence. Add a delicate sterling silver chain draped naturally around the bottle neck with a tiny dainty compass rose pendant–refined, jewelry-box scale, high-fashion charm aesthetic.
It helps to think of ARC-AGI-3 as a different test entirely than the previous ARC-AGIs. It measures different things (though, as in the previous tests, precisely what it measures isn’t clear) and has different rules. That doesn’t mean it isn’t good, but it is its own thing.
https://x.com/emollick/status/2037356753197617409
Kind of want a ARC-AGI-X test where a reputable organization runs it & builds a validated benchmark with outside expert help, but they never disclose the questions or even the nature of the challenges themselves so the tasks can never be targets. All we see is a leaderboard
https://x.com/emollick/status/2037106065553154521
This is true, but ARC-AGI-3 is also a test designed so that AI gets zero today, just as the earlier ARC-AGI tests were designed . Those tests were then mostly saturated with a year or two. The thing to watch with ARC-AGI-3 is whether we see the same progress.
https://x.com/emollick/status/2038680759305691586
A Mirror Test For LLMs — LessWrong
https://www.lesswrong.com/posts/TfKM9PgztxieEcKiv/a-mirror-test-for-llms
World Reasoning Arena – A comprehensive benchmark for evaluating world model – Expose a substantial gap between current models and human-level hypothetical reasoning
https://x.com/arankomatsuzaki/status/2038443186255991169





Leave a Reply