Technical and Dev: AI News Week Ending 03/28/2025

Technical and Dev: AI News Week Ending 03/28/2025

March 28, 2025

Announcing ARC-AGI-2 and ARC Prize 2025 https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025

“In addition to the ARC-AGI-2 release, we’re launching the ARC Prize 2025 competition, with a $700,000 grand prize for getting to 85%, as well as many other progress prizes. It will be live on Kaggle this week. We’re also reopening our public leaderboard for continuous benchmark” / X https://x.com/fchollet/status/1904266438959084003

“When working with LLMs I am used to starting “New Conversation” for each request. But there is also the polar opposite approach of keeping one giant conversation going forever. The standard approach can still choose to use a Memory tool to write things down in between” / X https://x.com/karpathy/status/1902737525900525657

“Ai2 releases a nice tool. Paper Finder. An LLM-based literature search tool that uncovers harder-to-find papers through iterative analysis, citation tracking, and semantic reformulations, achieving 89% perfect relevance in comprehensive mode and 85% in fast mode. @allen_ai ⚙️ https://x.com/rohanpaul_ai/status/1905310924195725551

“Some similarities between our brains & LLMs: “The study revealed a remarkable alignment between the neural activity in the human brain’s speech areas and the model’s speech embeddings & between the neural activity in the brain’s language area and the model’s language embeddings.”” / X https://x.com/emollick/status/1903500731899944995

“This is a more significant paper than people seem to realize. You can give the AI a novel picture of a location and it can, with reasonable accuracy, tell you where it was taken even if it hasn’t “seen” that picture before This is a finding with a lot of real-world implications” / X https://x.com/emollick/status/1903135115334594871

“New agents benchmark: CollaborativeAgentBench is the first benchmark studying collaborative LLM agents that work with humans across multi-turn collaboration on realistic tasks in backend programming & frontend design ⬇️ https://x.com/AIatMeta/status/1903146899458363442

Bria AI’s Cookbook https://bria.ai/bria-cookbook

“Evaluating LLM outputs can be time-consuming and challenging even for experts. That’s where LLM-as-a-Judge comes in. We believe all AI devs should get familiar with this technique. Here is why: LLM-as-a-Judge, automates the assessment of LLM outputs by using a specialized LLM https://x.com/dair_ai/status/1903098701440061592

“@SmokeAwayyy if benchmarks like this were a good measure of intelligence though, then why don’t we use it on people?” / X https://x.com/DavidSHolz/status/1904677609415598357

“man, the back-and-forth benchmarking between LLMs is intense – but no one really talks about how any of it impacts their product” / X https://x.com/DavidSHolz/status/1904673951357559171

“To celebrate the 10th anniversary of the original release, Keras has a brand new homepage! Live now. https://x.com/fchollet/status/1905391839055950032

“We get a bunch of questions on context management in langgraph – this should help!” / X https://x.com/hwchase17/status/1904247784087388252

Sakana AI on X: “Sakana AI super-powers AI reasoning using Japan’s own Sudoku Puzzles! Read more here → https://t.co/Sxqnpi0TuV At @NVIDIAGTC, Llion Jones @YesThisIsLion announced the release of our new reasoning benchmark based on the modern variant Sudoku to challenge the AI community. We https://t.co/8LnE9EpjYg” / X
https://x.com/SakanaAILabs/status/1902913196358611278

Training and Finetuning Reranker Models with Sentence Transformers v4 https://huggingface.co/blog/train-reranker

“Large language models face challenges with long sequences due to quadratic computational complexity. This paper introduces RWKV-7 “Goose,” a novel recurrent neural network architecture to address this. RWKV-7 generalizes the delta rule with vector-valued gating and learning https://x.com/rohanpaul_ai/status/1905221042542653703

TaoAvatar https://pixelai-team.github.io/TaoAvatar/

[2503.19683v1] Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection https://arxiv.org/abs/2503.19683v1

“This paper proposes contextual fine-tuning. Uses prompts that mimic human learning to guide model training for better domain knowledge integration. 📌 Contextual Fine-Tuning uses targeted prompts to subtly shift gradients, leading to 1.85%-4.32% better domain adaptation than https://x.com/rohanpaul_ai/status/1905221411188482380

“This paper proposes Likra, a likelihood-ratio model, to effectively use negative examples and boost accuracy. 📌 Likra leverages negative examples to unlock pre-trained knowledge for sharper accuracy gains. 📌 Negative training significantly enhances model’s ability to discern https://x.com/rohanpaul_ai/status/1904380721889730849

[2503.17095v1] FFaceNeRF: Few-shot Face Editing in Neural Radiance Fields https://arxiv.org/abs/2503.17095v1

[2503.15893v1] UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis https://arxiv.org/abs/2503.15893v1

“The Whale is BACK!! – looks like post-training update (i.e. we getting better down stream perf) 🔥 https://x.com/reach_vb/status/1904153415665517034

DisentTalk https://kangweiiliu.github.io/DisentTalk/

CoMP https://slimm-x.github.io/comp/

“Balanced disorder sparks fresh AI reasoning over long-haul sessions. A measured dose of chaos drives ongoing conceptual breakthroughs. Researchers show that balancing semantic and structural entropy fuels continuous AI reasoning over multi-day sessions, generating surprising https://x.com/rohanpaul_ai/status/1905232057527386304

[2501.15420v1] Visual Generation Without Guidance https://arxiv.org/abs/2501.15420v1

“This paper is proposing FreqBack, an efficient backdoor attack method. It uses frequency analysis for time series data. 📌 Heatmap pinpoints model’s frequency vulnerabilities for trigger placement. 📌 Frequency triggers optimize attack strength and minimize computation. 📌 https://x.com/rohanpaul_ai/status/1904927827629326510

FAR https://farlongctx.github.io/

““$/hour for highly diverse data” Collecting the most valuable training data at the lowest cost, accelerating the shift from teleoperation to autonomy. Even better if the robots are useful during this transition – and customers cover the cost. https://x.com/TheHumanoidHub/status/1903322171881165301

NeRSemble Benchmark https://kaldir.vc.in.tum.de/nersemble_benchmark/

NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes https://3dlg-hcvc.github.io/NuiScene/

“Scaling Laws of Synthetic Data for Language Models “In this work, we systematically investigate the scaling laws of synthetic data by introducing SYNTHLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.” “Key findings https://x.com/iScienceLuvr/status/1904750015647773130

[2503.15621v1] LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning https://arxiv.org/abs/2503.15621v1

[2503.17126] Modifying Large Language Model Post-Training for Diverse Creative Writing https://arxiv.org/abs/2503.17126

teller-avatar.github.io https://teller-avatar.github.io/

“Current machine learning (ML) training struggles to find optimal configurations from vast design spaces. This paper introduces metagradient descent (MGD), a gradient-based method to scalably configure model training. 📌 REPLAY enables scalable metagradient computation via https://x.com/rohanpaul_ai/status/1904886861174366244

[2503.15126v1] Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation https://arxiv.org/abs/2503.15126v1

“Today, we’re releasing ARC-AGI-2. It’s an AI benchmark designed to measure general fluid intelligence, not memorized skills – a set of never-seen-before tasks that humans find easy, but current AI struggles with. It keeps the same format as ARC-AGI-1, while significantly https://x.com/fchollet/status/1904265979192086882

“Google’s impressive Gemini 2.5 Pro (exp) model is now available on the AI SDK Playground. Your Vercel account gets you access to try and compare top models from @xai, @google, @amazon, @openai (incl. GPT-4.5), @anthropicai, @deepseek_ai, & more. https://x.com/rauchg/status/1904637435457527959