arxiv:2606.05308

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

Published on Jun 3

· Submitted by

Abhishek Divekar on Jun 15

Amazon

Upvote

Authors:

Abstract

PRECISE extends prediction-powered inference to correct bias in ranking metrics by combining human labels with LLM judgments, achieving reduced standard error and accurate variant ranking in production settings.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.

View arXiv page View PDF Project page Add to collection

Community

adivekar

Paper submitter 9 days ago

noahml

8 days ago

Neat paper. Using Prediction-Powered Inference to fix bias in LLM-based rankings feels like a really practical way to get more out of a small set of human labels. It is refreshing to see a method that explicitly claims to be unbiased regardless of how bad the LLM judge might be.

I am curious, how sensitive is the reduction in standard error when the LLM judge is significantly less accurate than the one used in the ESCI benchmark experiments?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/876681f6-0b3b-404f-a306-ddd1ac6081cb

librarian-bot

7 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.05308

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05308 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.05308 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05308 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.