BINEVAL 是一种新型 LLM-as-Judge 评估方法,解决整体评分隐藏推理与天花板效应。它将每个评估标准分解为原子的是/否问题,对每个输出独立回答,再汇总为校准的多维分数。每个问题级判定均可检查,用于精确定位低分原因,并直接作为提示改进信号。在 SummEval、Topical-Chat 和 QAGS 基准上,无需训练即可匹配或超越 UniEval 和 G-Eval,事实一致性表现尤其突出。论文: https://arxiv.org/abs/2606.27226
If you use LLM-as-judge, this one is worth reading.
(bookmark it)
It's actually one of the most effective ways to use LLM-as-a-Judge for evals.
Holistic judge scores hide both their reasoning and their ceiling effects.
BINEVAL decomposes each evaluation criterion into atomic yes-or-no questions, answers each independently per output, then aggregates the verdicts into calibrated multi-dimensional scores.
Every question-level verdict is inspectable, so you can diagnose exactly why an output scored low, and the same verdicts feed straight back as targeted prompt-improvement signal.
Across SummEval, Topical-Chat, and QAGS, it matches or beats UniEval and G-Eval, training-free, with especially strong results on factual consistency.
Paper: https://arxiv.org/abs/2606.27226
Learn to build effective AI agents in our academy: https://academy.dair.ai/