新论文"How LLMs See Creativity"测试大语言模型零样本评估图像创造力并输出可解释推理。多数模型与人类评分匹配良好,其中Gemini 3 Flash在两类图像上均领先。但模型存在明显偏见:对精美AI图像评分过高,对粗略草图评分偏低。三个模型展示的推理过程主要涉及所见内容、原创性、视觉质量和最终分数。研究表明视觉创造力评分可规模化,但偏见仍需校准。
LLMs can look at an image, judge its creativity, and reveal the logic behind the score.
Most models matched human scores fairly well, especially Gemini 3 Flash, which led on both image types.
But the models had clear biases: they rated polished AI images too generously and rough sketches too harshly.
When 3 models showed their reasoning, they mostly talked about what they saw, how original it seemed, visual quality, and the final score.
So this paper shows that visual creativity scoring can scale, while its biases still need calibration.
----
Link - arxiv. org/abs/2606.29672