Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference
Abstract
PRECISE extends prediction-powered inference to correct bias in ranking metrics by combining human labels with LLM judgments, achieving reduced standard error and accurate variant ranking in production settings.
With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.
Community
With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.
Neat paper. Using Prediction-Powered Inference to fix bias in LLM-based rankings feels like a really practical way to get more out of a small set of human labels. It is refreshing to see a method that explicitly claims to be unbiased regardless of how bad the LLM judge might be.
I am curious, how sensitive is the reduction in standard error when the LLM judge is significantly less accurate than the one used in the ESCI benchmark experiments?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/876681f6-0b3b-404f-a306-ddd1ac6081cb
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Can LLM Rerankers Predict Their Own Ranking Performance? (2026)
- Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation (2026)
- Joint Optimization of Relevance and Engagement in Multi-Task Ranking for E-Commerce with Efficient LLM Supervision (2026)
- From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation (2026)
- Prediction-powered Inference by Mixture of Experts (2026)
- Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge (2026)
- Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.05308 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper