# Claw-SWE-Bench：评估OpenClaw风格智能体框架编程能力的多语言基准

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-10 08:00
- AIHOT 分数：68
- AIHOT 链接：https://aihot.virxact.com/items/cmq912fmh07nrslldbcsktdbs
- 原文链接：https://arxiv.org/abs/2606.12344

## AI 摘要

Claw-SWE-Bench是一个多语言SWE-bench风格基准和适配器协议，用于在公平设置下比较通用智能体框架（claws）的编程能力。完整基准包含350个GitHub issue解决实例，覆盖8种语言和43个仓库，来源于SWE-bench-Multilingual和SWE-bench-Verified-Mini。同时发布80实例的Lite子集用于快速验证。在完整基准上，OpenClaw搭配最小适配器仅得19.1% Pass@1，而完整适配器使用相同GLM 5.1骨干达到73.4%，表明适配器设计至关重要。模型选择改变Pass@1达29.4个百分点，框架选择改变27.4个百分点；相似精度的系统总API成本差异巨大。Claw-SWE-Bench将框架和成本核算作为SWE风格编码智能体评估的第一类维度。

## 正文

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw times nine-model sweep and a five-claw times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.
