# τ-Rec：面向智能体型推荐系统的可验证基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-08 08:00
- AIHOT 分数：65
- AIHOT 链接：https://aihot.virxact.com/items/cmq9i84vz0c48slldahq1qj73
- 原文链接：https://arxiv.org/abs/2606.10156

## AI 摘要

τ-Rec 是一个面向智能体型推荐系统的评估基准，用可验证奖励和 reveal-tagged elicitation（RTE）机制替代主观的 LLM-as-a-judge 评估。该基准通过结构化目录谓词测试智能体，并采用 pass^k 可靠性指标衡量一致性推理。对五个模型族（GPT-5.4、Claude Sonnet 4.6、Gemini 2.5 Flash、DeepSeek V4 Flash、Qwen3-32B 和 GPT-5 mini）的九种配置评估发现显著的可靠性悬崖：最佳模型在 pass^1 上仅约 57%，在 pass^4 上降至约 38%，暴露出当前对话智能体部署中的关键差距。全部代码和数据已公开。

## 正文

As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present τ-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, τ-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.