光学推理（Optical Reasoning）：将图像作为独立推理媒介，token效率达文本1.96倍

2026-06-08 22:58·24天前

AI 摘要

光学推理（Optical Reasoning）提出将图像作为语言和多模态任务的独立推理媒介，包含基于印刷字体与基于图形两种变体，分别优化视觉布局和图文结构化组织。在数学、科学及交错模态推理基准上，光学推理匹配甚至超越传统文本推理，同时语言任务减少推理token 28.57%，多模态任务减少16%，token效率达到文本推理的1.96倍，证明图像能高效编码推理过程并提供统一的视觉推理画布。

原文 · 未翻译

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

HuggingFace Daily Papers（社区热门论文）

61导出 Markdown

光学推理（Optical Reasoning）：将图像作为独立推理媒介，token效率达文本1.96倍

2026-06-08 22:58·24天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译