# GameCraft-Bench：智能体能否在真实游戏引擎中端到端构建可玩游戏？

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-16 08:00
- AIHOT 分数：51
- AIHOT 链接：https://aihot.virxact.com/items/cmqhgiko20474sle10m1zsz3l
- 原文链接：https://arxiv.org/abs/2606.17861

## AI 摘要

GameCraft-Bench是一个基于Godot引擎的端到端游戏生成评测基准，包含15个游戏家族的140项任务，要求编码智能体将自然语言描述转化为可运行的游戏工件。评估框架以引擎接地、工件完整性和交互验证为核心，通过回放示范与评分表多模态判断度量可执行游戏质量。评测显示，最强智能体仅取得41.46%的成绩，多数低于40%。智能体虽能实现可识别游戏机制，但在提供完整内容、功能性视觉反馈和连贯呈现方面普遍不足。

## 正文

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.
