an abandoned factory in a forest with a trail sign that reads "Technology" --ar 5:3 --style raw

Tech Papers and Development: Week Ending 05/24/2024

May 24, 2024

an abandoned factory in a forest with a trail sign that reads “Technology” –ar 5:3 –style raw

This week’s category cover theme is a sign in a forest. Each category image prompt is a derivative of the formula “an [category themed object] in a forest with a trail sign that reads “[category name]”. Using a theme each week takes the cover creation time down to about 20 minutes, rather than several hours.

Reflections on AI Engineer Summit 2023

https://eugeneyan.com/writing/aieng-reflections

Grounded 3D-LLM

https://groundedscenellm.github.io/grounded_3d-llm.github.io

“Hallucinations are one of the biggest blockers to production LLMs & agents. No hallucinations (<5%) have been achieved internally — and for customers. We’ve been able to tune LLMs to recall specific key terms and figures with *photographic memory*, e.g chatting about a product” / X

Hallucinations are one of the biggest blockers to production LLMs & agents.

No hallucinations (<5%) have been achieved internally — and for customers.

We’ve been able to tune LLMs to recall specific key terms and figures with *photographic memory*, e.g chatting about a product…
— Sharon Zhou (@realSharonZhou) May 20, 2024

Mapping the Mind of a Large Language Model | Hacker News

https://news.ycombinator.com/item?id=40429326

“Nice article in Financial Time where I explain that Auto-Regressive LLM are insufficient to reach human-level intelligence (or even cat-level intelligence). But alternative architectures that I call “objective driven” may reach human-level intelligence one day. They use world…” / X

Nice article in Financial Time where I explain that Auto-Regressive LLM are insufficient to reach human-level intelligence (or even cat-level intelligence).

But alternative architectures that I call "objective driven" may reach human-level intelligence one day.
They use world… https://t.co/abYBqSITTg
— Yann LeCun (@ylecun) May 23, 2024

“My statement seems obvious. Still, most recent CLIP-style papers train on EN data, either explicitly, or via more subtle filters (CLIP filter, WordNet classes, …) Why? Several recent papers show this improves quality. But what quality? The usual suspects: ImageNet, COCO…” / X

My statement seems obvious. Still, most recent CLIP-style papers train on EN data, either explicitly, or via more subtle filters (CLIP filter, WordNet classes, …)

Why? Several recent papers show this improves quality. But what quality? The usual suspects: ImageNet, COCO…
— Lucas Beyer (bl16) (@giffmana) May 24, 2024

LangChain v0.2: A Leap Towards Stability

https://blog.langchain.dev/langchain-v02-leap-to-stability

Let’s talk about LLM evaluation

https://huggingface.co/blog/clefourrier/llm-evaluation

Ten Commandments to Deploy Fine-Tuned Models in Prod – Google Slides

https://docs.google.com/presentation/d/1IIRrTED0w716OsU_-PL5bONL0Pq_7E8alewvcJO1BCE/edit#slide=id.g2c28ff05645_0_0

“📄Refreshed docs for LangChain v0.2 We’ve listened to your feedback and made major improvements to our docs. With the release of LangChain v0.2 today, we now have versioned docs, with clearer structure and consolidated content. Our docs are separated into: • Tutorials:

📄Refreshed docs for LangChain v0.2

We've listened to your feedback and made major improvements to our docs. With the release of LangChain v0.2 today, we now have versioned docs, with clearer structure and consolidated content.

Our docs are separated into:
• Tutorials:… pic.twitter.com/TwFGKNqaD6
— LangChain (@LangChainAI) May 20, 2024

Improving Prompt Consistency with Structured Generations

https://huggingface.co/blog/evaluation-structured-outputs

“Enjoyed this extremely comprehensive study on predicting language model performance

Enjoyed this extremely comprehensive study on predicting language model performance https://t.co/SFJvkb37Ny. Found many insightful nuggets:
– In a single model family there usually aren't that many model sizes, which hinders predictive power. However, there are many model… pic.twitter.com/98Zti4A5wf
— Jason Wei (@_jasonwei) May 20, 2024

“Fun fact: Transformer was almost named “CargoNet” by Noam. I’m glad he was outvoted and history took a different turn. 😅” / X

Fun fact: Transformer was almost named “CargoNet” by Noam. I’m glad he was outvoted and history took a different turn. 😅 https://t.co/v7ro7kSQoG
— Jim Fan (@DrJimFan) May 20, 2024

“Metal’s reports let you run complex, multi-step AI operations on large amounts of company data. A few use cases? ✅ Streamline information requests ✅ Conduct initial ESG diligence ✅ Summarize call transcripts and discover insights 😎

Metal's reports let you run complex, multi-step AI operations on large amounts of company data.

A few use cases?
✅ Streamline information requests
✅ Conduct initial ESG diligence
✅ Summarize call transcripts and discover insights
😎https://t.co/AQnMdpOxKh
— Metal (@metal__ai) May 23, 2024

Auto Wiki by Mutable.ai
View high-quality, automatically-generated documentation for any repository.

https://wiki.mutable.ai

“Introducing “Hard Prompts” Category in Arena! In response to the community’s growing interest in evaluating models on more challenging tasks, we are excited to launch the new “Hard Prompts” category. We select user prompts that are more complex, specific, and problem-solving…

Introducing "Hard Prompts" Category in Arena!

In response to the community's growing interest in evaluating models on more challenging tasks, we are excited to launch the new "Hard Prompts" category.

We select user prompts that are more complex, specific, and problem-solving… pic.twitter.com/HxADmzJa3B
— lmsys.org (@lmsysorg) May 20, 2024

“Nice report on challenges in evaluating LLMs. It also includes a section on best practices for language model evaluation. Great read and lessons on the very difficult task of LLM evaluation.

Nice report on challenges in evaluating LLMs.

It also includes a section on best practices for language model evaluation.

Great read and lessons on the very difficult task of LLM evaluation.https://t.co/ARV9YCeJ0w pic.twitter.com/WxvpH0MevL
— elvis (@omarsar0) May 24, 2024

“Enabling sparse, foundational LLMs for faster and more efficient models from Neural Magic and Cerebras

Enabling sparse, foundational LLMs for faster and more efficient models from Neural Magic and Cerebras https://t.co/9uXfzeFOf2
— /MachineLearning (@slashML) May 21, 2024

“Introducing new factual knowledge through fine-tuning an LLM will increase the risk of hallucinations. This is what @Google is exploring in this paper and posits LLMs mostly acquire factual knowledge through pre-training, whereas fine-tuning teaches them to use it more

Introducing new factual knowledge through fine-tuning an LLM will increase the risk of hallucinations.

This is what @Google is exploring in this paper and posits LLMs mostly acquire factual knowledge through pre-training, whereas fine-tuning teaches them to use it more… pic.twitter.com/fiUyBmC3ww
— Rohan Paul (@rohanpaul_ai) May 19, 2024

“So far there are 3 major use cases for LLMs: 1. StackOverflow replacement 2. Do my homework (might be #1 or tied for #1) 3. Internal enterprise knowledge base There are many smaller use cases, including customer support chatbots, copyediting, spam, etc.” / X

So far there are 3 major use cases for LLMs:

1. StackOverflow replacement
2. Do my homework (might be #1 or tied for #1)
3. Internal enterprise knowledge base

There are many smaller use cases, including customer support chatbots, copyediting, spam, etc.
— François Chollet (@fchollet) May 19, 2024

“By end of 2024, steering foundation models in latent/activation space will outperform steering in token space (“prompt eng”) in several large production deployments. I felt skeptical about this in summer ’23, felt vaguely positive in Jan, and now think it’s more likely than not,” / X

By end of 2024, steering foundation models in latent/activation space will outperform steering in token space ("prompt eng") in several large production deployments.

I felt skeptical about this in summer '23, felt vaguely positive in Jan, and now think it's more likely than not,…
— Linus (@thesephist) May 21, 2024

“I discovered at ICLR 2024 that a lot of what I take for granted about LLM evaluation is actually not that widely known… So I made a blog! – how do we do currently do LLM evaluation? ⚖️ – most importantly, what is it actually useful for? 🤔

I discovered at ICLR 2024 that a lot of what I take for granted about LLM evaluation is actually not that widely known…

So I made a blog!
– how do we do currently do LLM evaluation? ⚖️
– most importantly, what is it actually useful for? 🤔https://t.co/6aapgZEvij
— Clémentine Fourrier 🍊 is growing strawberries atm (@clefourrier) May 22, 2024

“once a week i tell a founder “stop trying to finetune models, and just go sell, use opus, use 4-turbo, and just raise prices, find value, go sell, and sell to rich people, stop selling to developers, sell to capital allocators, and not wage workers. make your roadster, get the” / X

once a week i tell a founder "stop trying to finetune models, and just go sell, use opus, use 4-turbo, and just raise prices, find value, go sell, and sell to rich people,

stop selling to developers, sell to capital allocators, and not wage workers. make your roadster, get the…
— jason liu (@jxnlco) May 23, 2024

Introducing "Hard Prompts" Category in Arena!

In response to the community's growing interest in evaluating models on more challenging tasks, we are excited to launch the new "Hard Prompts" category.

We select user prompts that are more complex, specific, and problem-solving… pic.twitter.com/HxADmzJa3B
— lmsys.org (@lmsysorg) May 20, 2024

“We don’t need skip connections or normalization layers either

We don’t need skip connections or normalization layers eitherhttps://t.co/SxSVoNWtgx https://t.co/5mU1e3xKk2
— rohan anil (@_arohan_) May 22, 2024

“Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? This is one of the more interesting LLM papers I read last week. It reports that LLMs struggle to acquire factual knowledge through fine-tuning. When examples with new knowledge are eventually learned they

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

This is one of the more interesting LLM papers I read last week.

It reports that LLMs struggle to acquire factual knowledge through fine-tuning.

When examples with new knowledge are eventually learned they… pic.twitter.com/Ft5ECOidWN
— elvis (@omarsar0) May 22, 2024

“If 2024 papers are to be trusted: You don’t need (most) attention you don’t need (most) kv cache You don’t need (most) FFN layers You don’t need a reward model You don’t need… all the stuff that still makes frontier models work, ironically” / X

If 2024 papers are to be trusted:
You don't need (most) attention
you don't need (most) kv cache
You don't need (most) FFN layers
You don't need a reward model
You don't need…
all the stuff that still makes frontier models work, ironically
— Teortaxes▶️ (@teortaxesTex) May 22, 2024

“Achieve the performance of existing transformer LLMs, while requiring 5% of the training cost. 🔥 Paper: Linearizing Large Language Models 📌 Introduces a method called Scalable UPtraining for Recurrent Attention (SUPRA), that allows the conversion of pre-trained LLMs into

Achieve the performance of existing transformer LLMs, while requiring 5% of the training cost. 🔥

Paper: Linearizing Large Language Models

📌 Introduces a method called Scalable UPtraining for Recurrent Attention (SUPRA), that allows the conversion of pre-trained LLMs into… pic.twitter.com/4QRneBZL06
— Rohan Paul (@rohanpaul_ai) May 20, 2024

“🧬🧬 LLM Generated UIs We’ve added a series of templates and documentation showing off how to build generative UI applications using LangChain JS/TS & Next.js. These templates include: – 🌆 generative UI in Next.js – 🤖 streaming agent events – 🛠️ streaming tool calls and more!

🧬🧬 LLM Generated UIs

We've added a series of templates and documentation showing off how to build generative UI applications using LangChain JS/TS & Next.js. These templates include:
– 🌆 generative UI in Next.js
– 🤖 streaming agent events
– 🛠️ streaming tool calls
and more!… pic.twitter.com/Cz12NZfbyr
— LangChain (@LangChainAI) May 23, 2024

“As part of 0.2, we did a docs overhaul: 📃versioned docs 🗺️MUCH simpler navigation 🪆”LangChain over time” section Would love feedback on the new structure and additions!” / X

As part of 0.2, we did a docs overhaul:

📃versioned docs
🗺️MUCH simpler navigation
🪆"LangChain over time" section

Would love feedback on the new structure and additions! https://t.co/oS8qaou4Au
— Harrison Chase (@hwchase17) May 20, 2024

“A saturated benchmark gives a false impression that the underlying progress is slowing down. Benchmarks are proxy for what we care about, which are often hard to measure. When they are saturated, they are useless and even misleading.” / X

A saturated benchmark gives a false impression that the underlying progress is slowing down.

Benchmarks are proxy for what we care about, which are often hard to measure. When they are saturated, they are useless and even misleading.
— Hyung Won Chung (@hwchung27) May 23, 2024

“It’s nice to have good names for things. I’m proud to have named or been involved in naming a bunch of things at Google over the years, including: MapReduce Bigtable Spanner TensorFlow Tensor Processing Units (TPUs) Pathways Protocol Buffers PaLM Gemini” / X

It's nice to have good names for things. I'm proud to have named or been involved in naming a bunch of things at Google over the years, including:

MapReduce
Bigtable
Spanner
TensorFlow
Tensor Processing Units (TPUs)
Pathways
Protocol Buffers
PaLM
Gemini https://t.co/buuKSyJZNS
— Jeff Dean (@🏡) (@JeffDean) May 20, 2024

PEERING THROUGH PREFERENCES: UNRAVELING FEEDBACK ACQUISITION FOR ALIGNING LARGE LANGUAGE MODELS

https://arxiv.org/pdf/2308.15812

Chameleon: Mixed-Modal Early-Fusion Foundation

Models

https://arxiv.org/pdf/2405.09818

[2104.14337] Dynabench: Rethinking Benchmarking in NLP

https://arxiv.org/abs/2104.14337

[2309.03882] Large Language Models Are Not Robust Multiple Choice Selectors

https://arxiv.org/abs/2309.03882

[2311.06233] Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models

https://arxiv.org/abs/2311.06233

[2404.13076] LLM Evaluators Recognize and Favor Their Own Generations

https://arxiv.org/abs/2404.13076

[2405.09789v1] LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation

https://arxiv.org/abs/2405.09789v1

[2405.10508] ART3D: 3D Gaussian Splatting for Text-Guided Artistic Scenes Generation

https://arxiv.org/abs/2405.10508

[2405.10523v1] Smart Expert System: Large Language Models as Text Classifiers

https://arxiv.org/abs/2405.10523v1

[2405.10612v1] Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers

https://arxiv.org/abs/2405.10612v1

[2405.12209v1] MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

https://arxiv.org/abs/2405.12209v1

[2405.12564v1] ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

https://arxiv.org/abs/2405.12564v1

[2405.12710v1] Text-Video Retrieval with Global-Local Semantic Consistent Learning

https://arxiv.org/abs/2405.12710v1

[2405.12832v1] Wav-KAN: Wavelet Kolmogorov-Arnold Networks

https://arxiv.org/abs/2405.12832v1

A root-server at the Internet’s core lost touch with its peers. We still don’t know why. | Ars Technica

https://arstechnica.com/security/2024/05/dns-glitch-that-threatened-internet-stability-fixed-cause-remains-unclear

“‘we are nowhere near the point of diminishing marginal returns on how powerful we can make AI models as we increase the scale of compute’

'we are nowhere near the point of diminishing marginal returns on how powerful we can make AI models as we increase the scale of compute' pic.twitter.com/ZCXDzokJKr
— Andrew Curran (@AndrewCurran_) May 21, 2024

“(1/9) LLM as a Judge: Numeric Score Evals are Broken!!! LLM Evals are valuable analysis tools. But should you use numeric scores or classes as outputs? 🤔 TLDR: LLM’s suck at continuous ranges ☠️ – use LLM classification evals instead! 🔤 An LLM Score Eval uses an LLM to judge

(1/9) LLM as a Judge: Numeric Score Evals are Broken!!!

LLM Evals are valuable analysis tools. But should you use numeric scores or classes as outputs? 🤔

TLDR: LLM’s suck at continuous ranges ☠️ – use LLM classification evals instead! 🔤

An LLM Score Eval uses an LLM to judge… pic.twitter.com/cATZEomvZl
— Aparna Dhinakaran (@aparnadhinak) January 19, 2024

“Layer-Condensed KV Cache for Efficient Inference of Large Language Models Achieves up to 26× higher throughput than standard transformers and competitive performance in language modeling and downstream tasks repo:

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Achieves up to 26× higher throughput than standard transformers and competitive performance in language modeling and downstream tasks

repo: https://t.co/CQNlw1k8dH
abs: https://t.co/zUlKAq9S1M pic.twitter.com/SUklfwvhTz
— Aran Komatsuzaki (@arankomatsuzaki) May 20, 2024

“Observational Scaling Laws and the Predictability of Language Model Performance Presents observational scaling laws – an approach that generalizes existing compute scaling laws to handle multiple model families using a shared, low-dim capability space

Observational Scaling Laws and the Predictability of Language Model Performance

Presents observational scaling laws – an approach that generalizes existing compute scaling laws to handle multiple model families using a shared, low-dim capability spacehttps://t.co/Vn1qBaoxZ9 pic.twitter.com/FweCj7YOsk
— Aran Komatsuzaki (@arankomatsuzaki) May 20, 2024

Feature UMAP

https://transformer-circuits.pub/2024/scaling-monosemanticity/umap.html?targetId=1m_1013764

First-ever AI Code Interpreter for R

https://caesarhq.notion.site/First-ever-AI-Code-Interpreter-for-R-7a596fe5ee8449469fe8f60ec2d3fa21

Documentation Refresh for LangChain v0.2

https://blog.langchain.dev/documentation-refresh-for-langchain-v0-2

I discovered at ICLR 2024 that a lot of what I take for granted about LLM evaluation is actually not that widely known…

So I made a blog!
– how do we do currently do LLM evaluation? ⚖️
– most importantly, what is it actually useful for? 🤔https://t.co/6aapgZEvij
— Clémentine Fourrier 🍊 is growing strawberries atm (@clefourrier) May 22, 2024

“And it’s out! 😀 A good read if you want to think about doing robust evaluation, going in depths into the nits of it.

And it's out! 😀

A good read if you want to think about doing robust evaluation, going in depths into the nits of it.https://t.co/y7k6Q9X4A6 https://t.co/RpeQ71brSu pic.twitter.com/x4btq7E5rb
— Clémentine Fourrier 🍊 is growing strawberries atm (@clefourrier) May 24, 2024

“🚀How can we use LLMs to accelerate scientific discovery? Let’s find out! This year, hundreds of people from across the globe worked together in a hackathon to BUILD groundbreaking prototypes — showing the path to breakthroughs in next generation batteries, sustainability,

🚀How can we use LLMs to accelerate scientific discovery? Let's find out! This year, hundreds of people from across the globe worked together in a hackathon to BUILD groundbreaking prototypes — showing the path to breakthroughs in next generation batteries, sustainability,… pic.twitter.com/qRSHruDqvc
— Ben Blaiszik (@BenBlaiszik) May 21, 2024

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

https://transformer-circuits.pub/2023/monosemantic-features/index.html

We’ve raised $12M in Series A funding to reimagine presentations, powered by AI.

https://gamma.app/docs/Weve-raised-12M-in-Series-A-funding-to-reimagine-presentations-po-1mmk923dzxyrn2t?mode=doc

What I’ve Learned Building Interactive Embedding Visualizations – Casey Primozic’s Homepagehttps://cprimozic.net/blog/building-embedding-visualizations-from-user-profiles/

Heads up! You’ve scrolled to the end of this category. There may have been just one or two links (above), so go back up and double check to be sure you didn’t quickly scroll down past it.

Be Sure To Read This Week’s Main Post:

This week’s executive overview and top links are here:

AI News #34: Week Ending 05/24/2024 with Executive Summary and Top 47 Links

The post you just read is an deep dive extension of my weekly newsletter, This Week In AI, an executive summary of the top things to know in AI. Each week, I create an accessible overview for laypeople to feel confident they are conversant with the week’s AI developments. I include a curated list of must-click links of the week, to offer everyone a hands-on opportunity to explore the most intriguing updates in artificial intelligence across various categories, including robotics, imagery, video, AR/VR, science, ethics, and more. Beyond the overview, I post these topic-based deeper dives (below). If you haven’t read this week’s overview, I recommend starting there.

Credits/Sources

Most of these weekly links come from just a few prolific oversharing sources. Please follow them, as they work hard to find the news each week and they make it a lot easier for me to compile.