# Elvis Saravia：多模态提示是智能体交互的未来

- 来源：elvis (@omarsar0)
- 发布时间：2026-07-04 21:52
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmr6g0s2f02p0slf0qhndx9ym
- 原文链接：https://x.com/omarsar0/status/2073404610501329247

## AI 摘要

Elvis Saravia（DAIR.AI）分享多模态提示实验经验。他通过录制语音、屏幕标注、鼠标点击等动作构建多模态“任务”，预处理后传给智能体，使任务完成更高效。该方法节省数小时工作，减少交互挫折感。他将录制任务打包为可复用的技能/工作流，应用于网页开发、设计、原型、研究、模拟等场景。他认为未来原生多模态模型将直接消费这类丰富输入。

## 正文

http://x.com/i/article/2073402070753738752

# Multimodal Prompting for Agents

Multimodal prompting is clearly the future.

I love experimenting with new ways to interact with agents.

As a researcher and engineer， I've found that the richer the inputs to the agent and the richer the outputs I consume， the better the overall results of the collaboration.

https://youtu.be/_rIziQa48wQ?si=n6tS_uWlmzaO2vJ6

In this little walkthrough， I go over what I mean by a multimodal prompt and when you might find it useful.

It's more than simple text prompting， so I call it a "task" for lack of a better word. It helps me record my voice， annotate the screen， click/mouse actions， and more.

Then all of that is preprocessed and passed to the agent to complete the task more efficiently. The agent has the high-level prompt， but it also has the raw transcriptions if needed. So naturally， I am also using this to build out multimodal skills that I reuse in workflows where agents tend to struggle. This has saved me hours of work. And even the older models are pretty great at understanding the tasks more clearly. Some noise is introduced in the process， but it doesn't seem to hurt the performance. I've also found that this new way of prompting has reduced the number of frustrating interactions I have with agents.

This is something I have been thinking about for some time now because we are going to move more into multimodal AI models. And so the interactions are going to evolve with models being able to handle a variety of modalities natively. Currently， I process all the recorded tasks with another model in the background， but it's not crazy to think that all of it will just be naturally consumed by omnimodel in the future.

All of these recorded tasks （which you can also think of as rich annotated datasets） are things I mine and recursively improve over time and， in some cases， package as reusable workflows/patterns/skills.

This process has really elevated how I use coding agents for all kinds of work. I use multimodal prompting in things like web development， designing， artifact creation， prototyping， researching， reading， simulations， AI-assisted writing， and much more. So it's not just about prompting. It's going deeper into understanding and exploring the right level of detail that the agents need to make the right decision and to push/maximize their capabilities.
