作者从打字提示转向完全用语音与AI智能体交互,发现通过音频能提供更丰富的细节,语音越长越详细,结果越好。这种交互方式还能并行化更多工作,让智能体执行更长时间任务。作者开发了新功能:录制屏幕、截图、追踪鼠标动作、用语音标注解释智能体难以处理的设计和精确功能开发。结论是提示模态越丰富,智能体结果越可靠,虽然消耗更多token成本更高,但可靠性值得。这些模式可存储为可重用技能,效果天差地别。
Finally caved in, and I now fully speak to agents as opposed to typing prompts.
My first realization is that you can just blabber on and tell the agent so many rich details via audio. The longer and the more detailed the audio explanation, the better the results.
The most interesting thing about interacting with the agent this way is that I can parallelize more work and enable agents to perform way longer runs, implementing many things at once.
In addition, I have developed a new feature where I can record the screen, take screenshots, track mouse actions and movements, annotate, and explain (using voice) to the agent things that it struggles with, like design and precise feature development.
My finding is that the richer the prompt modality, the more reliable the agent results are. The noise (if any) doesn't even matter. Yes, it's more expensive (i.e., lots more tokens used this way), but the reliability that you are getting is worth it.