Elvis Saravia(DAIR.AI)分享多模态提示实验经验。他通过录制语音、屏幕标注、鼠标点击等动作构建多模态“任务”,预处理后传给智能体,使任务完成更高效。该方法节省数小时工作,减少交互挫折感。他将录制任务打包为可复用的技能/工作流,应用于网页开发、设计、原型、研究、模拟等场景。他认为未来原生多模态模型将直接消费这类丰富输入。
http://x.com/i/article/2073402070753738752
Multimodal Prompting for Agents
Multimodal prompting is clearly the future.
I love experimenting with new ways to interact with agents.
As a researcher and engineer, I've found that the richer the inputs to the agent and the richer the outputs I consume, the better the overall results of the collaboration.
https://youtu.be/_rIziQa48wQ?si=n6tS_uWlmzaO2vJ6
In this little walkthrough, I go over what I mean by a multimodal prompt and when you might find it useful.
It's more than simple text prompting, so I call it a "task" for lack of a better word. It helps me record my voice, annotate the screen, click/mouse actions, and more.