研究发现手机智能体在执行日常任务时存在严重隐私隐患。通过MyPhoneBench评估,最佳模型任务完成率达82.8%,但隐私合格分数仅47.6%。隐私风险源于"过度帮助"——模型为完成任务会索要不需要的个人信息、向无关组件重复披露数据或过度填充可选字段。Claude任务成功率领先,Kimi隐私保护最佳,Qwen综合得分最高。研究表明,仅以成功率为标准的基准测试混淆了能力与判断力,在手机这类私密设备上构成严重安全隐患。
This paper asks whether phone-use agents protect your data during ordinary tasks, and finds that they often do not.
The best model completed 82.8% of tasks, but the best privacy-qualified score was only 47.6%.
That gap matters because privacy failure here is not sabotage. It is ordinary over-helpfulness.
A phone agent can finish your food order, book your appointment, or fill your travel form while still asking for a phone number it did not need, re-entering it into a coupon box, or stuffing optional fields with personal details just because the boxes were there.
To measure that behavior, the authors built MyPhoneBench, which logs exactly what agents type, where they type it, and whether any of it was necessary.