# EasyVideoR1：面向视频理解的更简易 RL 框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-18 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmo84qcqu03w7slmlhur0wux0
- 原文链接：https://arxiv.org/abs/2604.16893

## AI 摘要

EasyVideoR1 是一款专为视频理解任务设计的强化学习框架，通过离线预处理与张量缓存技术消除冗余视频解码，将训练吞吐量提升1.47倍。该框架支持11种视频及图像任务类型的统一奖励路由，采用离线-在线混合数据训练范式，并实现双模态联合训练与独立像素预算配置。其异步评估系统覆盖22个主流视频理解基准，复现精度与官方报告高度一致，为视觉语言模型的视频推理训练提供了完整高效的基础设施。

## 正文

Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present EasyVideoR1, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 times throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.
