Spotify Labs

Better Experiments with LLM Evals — A funnel, not a fork

May 18, 2026•

Level:Intermediate

For:ML Engineers

✦TL;DR

Spotify engineers propose a new approach to Large Language Model (LLM) evaluations, leveraging automated judges to assess relevance, coherence, and quality at scale, resulting in a 90% reduction in manual evaluation time. This funnel-based evaluation process prioritizes high-quality samples while filtering out low-quality ones, enabling more efficient experimentation and model development. By focusing on the top 10% of samples, engineers can achieve similar performance gains with 10% of the manual evaluation effort. This approach can be particularly useful for large-scale model development and deployment in production environments where efficiency is crucial.

⚡ Key Takeaways

90% reduction in manual evaluation time
Automated judges assess relevance, coherence, and quality at scale
Funnel-based evaluation process prioritizes high-quality samples
Focusing on top 10% of samples achieves similar performance gains with 10% of manual effort
Suitable for large-scale model development and deployment in production environments
Requires manual evaluation to validate the funnel-based approach
WhyItMatters: This approach can significantly reduce the time and effort required for LLM evaluations, enabling engineers to experiment and develop models more efficiently and effectively. This is particularly important in production environments where model updates and deployments need to be frequent and efficient.
TechnicalLevel: Intermediate
TargetAudience: ML Engineers
PracticalSteps:
Implement automated judges for relevance, coherence, and quality assessments
Design a funnel-based evaluation process to prioritize high-quality samples
Validate the funnel-based approach with manual evaluations
Integrate the funnel-based evaluation process into existing model development workflows
ToolsMentioned: None
Tags: LLM

💡 Why It Matters

This approach can significantly reduce the time and effort required for LLM evaluations, enabling engineers to experiment and develop models more efficiently and effectively. This is particularly important in production environments where model updates and deployments need to be frequent and efficient.

✅ Practical Steps

Implement automated judges for relevance, coherence, and quality assessments
Design a funnel-based evaluation process to prioritize high-quality samples
Validate the funnel-based approach with manual evaluations
Integrate the funnel-based evaluation process into existing model development workflows

Want the full story? Read the original article.

Read on Spotify Labs ↗

Better Experiments with LLM Evals — A funnel, not a fork

⚡ Key Takeaways

✅ Practical Steps

More like this

Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient

Reliable LLM Inference at Scale

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

Building a Context Pruning Pipeline for Long-Running Agents