# 聚类、路由、升级：面向成本感知的LLM服务的级联框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-25 08:00
- AIHOT 分数：44
- AIHOT 链接：https://aihot.virxact.com/items/cmqz6v5jo004qslj1vtwzfwid
- 原文链接：https://arxiv.org/abs/2606.27457

## AI 摘要

提出一种两阶段级联方案，用于LLM生产部署的成本-精度平衡。第一阶段将查询聚类并分配给最经济的模型；第二阶段引入质量估计（QE）级联，将低质量输出升级至更强模型。在测试集上，该系统保留了最强模型97-99%的准确性，同时降低了每个输出token的时间（TPOT）。仅需任务正确性标签即可适应模型池变化，无需手动重新配置。

## 正文

Efficient deployment of large language models (LLMs) in production forces a trade-off between accuracy and cost. Operators often default to a single model that is either expensive for easy queries or insufficient for hard ones. To address this challenge, we propose a two-stage cascaded solution. Stage 1 clusters incoming queries and assigns each cluster to its most cost-effective model. The cost budget for this routing process is set by an interpretable hyperparameter, tuned offline. Stage 2 adds a quality estimation (QE) cascade; when an output from Stage 1 is judged low-quality, the query is escalated to a stronger model. This ensures only hard or low-confidence cases reach the expensive models. On the test datasets, the cascaded system retains 97-99% of the strongest model's accuracy while reducing Time Per Output Token (TPOT). It requires only task-correctness labels and adapts to changes in the model pool without manual reconfiguration.