# River-LLM：基于KV共享的大语言模型无缝早退机制

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-20 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo933pac00f1sls21f7qpmrj
- 原文链接：https://arxiv.org/abs/2604.18396

## AI 摘要

River-LLM是一种无需训练的大语言模型加速框架，通过KV-Shared Exit River机制解决早退技术中的KV缓存缺失问题，使被跳过层缺失的历史状态能在退出过程中自然生成和保留，避免昂贵的重计算或精度损失。该方法利用解码器块内的状态转移相似性预测累积KV误差以指导退出决策，在数学推理和代码生成任务中实现1.71至2.16倍的实际推理加速，同时保持高生成质量。

## 正文

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone's missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.71 to 2.16 times of practical speedup while maintaining high generation quality.