# Oppo 开源 Android AI 代理 X-OmniClaw，无需离开手机即可调用摄像头、屏幕与语音

- 来源：The Decoder：AI News（RSS）
- 作者：Jonathan Kemper
- 发布时间：2026-05-17 15:39
- AIHOT 分数：69
- AIHOT 链接：https://aihot.virxact.com/items/cmp9hip5r0qw1slnz252aarop
- 原文链接：https://the-decoder.com/oppo-open-sources-android-ai-agent-x-omniclaw-that-uses-your-camera-screen-and-voice-without-leaving-the-phone

## AI 摘要

Oppo 的 Multi-X 团队发布了开源 AI 代理 X-OmniClaw，该代理可直接在 Android 设备上运行。它整合摄像头、屏幕和语音输入，在真实应用程序中实时处理任务。系统主要依赖本地传感器执行操作，仅将推理任务交由云端计算。用户的操作路径可被克隆为可复用技能，代理下次能通过深度链接直接跳转到应用深层页面，无需重复操作。

## 正文

Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone

Oppo's Multi-X team released X-OmniClaw, an open-source agent that taps into the camera, screen, and voice to get things done in real Android apps, all without routing through a cloud copy of your phone.

In the technical report, Oppo's AI Center draws a clear line between its approach and cloud phone platforms like RedFinger, Alibaba's Wuying, and Tencent Cloud Phone. Those services run agents inside virtualized Android instances in a data center. That means they can't touch local sensors, cameras, or private data.

X-OmniClaw takes the opposite route. It runs directly on the physical Android device. Core logic for perception, control, and app interaction all live on the phone itself. A cloud language model only gets called in as "fuel" for higher-level reasoning when needed, the report says. It doesn't name the specific local models involved, but it does list components like an on-device grounding model and OCR for detecting tappable UI elements.

Camera, screen, and voice feed into a single pipeline

The agent bundles three perception channels into one pipeline. A vision-language model first interprets the scene along with the user's request before triggering any action.

In the researchers' example, a user asks "How much does this cost on Taobao?" while pointing the camera at a product. The system rephrases that internally to "price of Evian spray on Taobao" and only then hands the structured intent off for execution.

Photo gallery becomes searchable memory

For long-term memory, X-OmniClaw condenses local data into semantic entries. During idle time, gallery photos get processed into compact descriptions of objects, scenes, and events, then stored in a Markdown file.

Every entry runs through a filter designed to strip out sensitive info before it's saved. The report flags upload risks tied to cloud vision. Moving to on-device models is the next step, the report says, so raw images never have to leave the phone.

Cloned tap paths replace step-by-step replays

Instead of planning every action from scratch, the agent clones user behavior into reusable skills. It extracts the full launch command for an app page and jumps there directly via deeplink next time, rather than replaying the original tap path.

If that fails, the system falls back through simpler launch methods one by one. To detect tappable elements, X-OmniClaw combines XML structure data with a grounding model and text recognition. That helps with ad-heavy interfaces where XML alone can't pin down a precise tap target.

From price checks to homework help

In the first scenario, a user points the camera at a product and asks about the price. The agent jumps into the shopping app, scrolls, takes screenshots, and reads out prices and sales figures through a vision-language model. A follow-up like "open the second item" works without any extra grounding.

In another example, X-OmniClaw acts as a "ScreenAvatar," a "digital surrogate" that solves on-screen tasks on command, like working through a series of practice problems one after another.

A third demo shows the system responding to a request to turn all parrot photos into a highlight album. It gathers matching files, jumps via deeplink into a video editing app's one-click composition tool, and selects the images with multi-tap.

In the fourth example, the user clones the path to a deeply nested discount page once. Next time, a voice command is enough to reopen that exact subpage , even if the app doesn't offer public deeplinks.

The project builds on the open-source HermesApp codebase and sits between OpenClaw, which focuses more on PCs, and the emergent-capability-driven Hermes Agent from Nous Research. Code and assets are available on GitHub.

Google recently showed with Gemma 4 that a fully local model on a smartphone can already act as an agent. In the demo app "Google AI Edge Gallery," the model uses agent skills to query Wikipedia, generate QR codes, or open mood trackers with trend charts.

In terms of method, the system builds on ByteDance's UI-TARS, a purely visual GUI agent that relies only on screenshots and coordinates. X-OmniClaw combines that approach with structural XML data and on-device execution to cut down on the error rate that pure vision pipelines hit with dynamic interfaces.

AI News Without the Hype – Curated by Humans