原文 · 未翻译
Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone
Oppo's Multi-X team released X-OmniClaw, an open-source agent that taps into the camera, screen, and voice to get things done in real Android apps, all without routing through a cloud copy of your phone.
In the technical report, Oppo's AI Center draws a clear line between its approach and cloud phone platforms like RedFinger, Alibaba's Wuying, and Tencent Cloud Phone. Those services run agents inside virtualized Android instances in a data center. That means they can't touch local sensors, cameras, or private data.
X-OmniClaw takes the opposite route. It runs directly on the physical Android device. Core logic for perception, control, and app interaction all live on the phone itself. A cloud language model only gets called in as "fuel" for higher-level reasoning when needed, the report says. It doesn't name the specific local models involved, but it does list components like an on-device grounding model and OCR for detecting tappable UI elements.
Camera, screen, and voice feed into a single pipeline
The agent bundles three perception channels into one pipeline. A vision-language model first interprets the scene along with the user's request before triggering any action.
In the researchers' example, a user asks "How much does this cost on Taobao?" while pointing the camera at a product. The system rephrases that internally to "price of Evian spray on Taobao" and only then hands the structured intent off for execution.
Photo gallery becomes searchable memory
For long-term memory, X-OmniClaw condenses local data into semantic entries. During idle time, gallery photos get processed into compact descriptions of objects, scenes, and events, then stored in a Markdown file.
Every entry runs through a filter designed to strip out sensitive info before it's saved. The report flags upload risks tied to cloud vision. Moving to on-device models is the next step, the report says, so raw images never have to leave the phone.