NVIDIA Open-SWE-Traces:构建监督微调数据--轨迹解析、补丁分析与 token 预算
阅读原文· marktechpost.com本文介绍如何从 Hugging Face 流式加载 nvidia/Open-SWE-Traces 数据集,解析 openhands、sweagent 等智能体与 minimax_m25、qwen35_122b 模型的轨迹,标准化多轮对话,解析最终代码补丁并统计新增/删除行数、文件扩展名分布。构建分析 DataFrame 考察轨迹长度、工具调用、补丁规模、语言分布及解决结果。基于成功标签、MAX_SFT_TOKENS=32000 的 token 限制、语言过滤和补丁可用性,筛选高质量轨迹形成监督微调子集。
In this tutorial, we explore the Open-SWE-Traces dataset as a practical resource for studying and preparing agentic software-engineering trajectories for fine-tuning. We stream the dataset directly from Hugging Face, so we can work with a large dataset efficiently in Google Colab without downloading everything locally. We inspect individual records, normalize multi-turn agent conversations, parse final code patches, extract useful metadata, and build an analysis DataFrame to understand trajectory length, tool usage, patch size, language distribution, and resolution outcomes. We then use these insights to create a curated supervised fine-tuning subset that keeps only high-quality trajectories based on success labels, token limits, language filters, and patch availability.
Installing Dependencies and Configuration
import subprocess, sys
def _pip(*pkgs):
subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=False)
_pip("-U", "datasets", "huggingface_hub")
_pip("tiktoken", "pandas", "matplotlib")
import json
import re
import textwrap
from itertools import islice
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 160)
plt.rcParams.update({
"figure.figsize": (9, 4.6),
"figure.dpi": 110,
"axes.grid": True,
"grid.alpha": 0.25,
"axes.spines.top": False,
"axes.spines.right": False,
"font.size": 11,
"axes.titlesize": 13,
"axes.titleweight": "bold",
})
BLUE, ORANGE, GREEN, RED = "#4C72B0", "#DD8452", "#55A868", "#C44E52"
def banner(title):
line = "=" * 78
print(f"\n{line}\n {title}\n{line}")
DATASET = "nvidia/Open-SWE-Traces"
AGENTS = ["openhands", "sweagent"]
MODELS = ["minimax_m25", "qwen35_122b"]
SAMPLE_ALL = True
PER_COMBO = 400
N_SINGLE = 1500
MAX_SFT_TOKENS = 32000
SFT_REQUIRE_RESOLVED = True
SFT_LANGUAGES = None
We start by installing and importing the core libraries needed for streaming, parsing, analysis, and visualization. We configure pandas and matplotlib to ensure our tables and plots remain readable in Google Colab. We also define the dataset name, agent/model combinations, sampling size, and SFT filtering settings that control the rest of the tutorial.
Defining Trajectory Parsing Helpers
def message_text(msg):
if not isinstance(msg, dict):
return ""
content = msg.get("content", "")
if content is None:
return ""
if isinstance(content, str):
return content
if isinstance(content, list):
parts = []
for block in content:
if isinstance(block, dict):
parts.append(block.get("text") or block.get("content") or "")
elif isinstance(block, str):
parts.append(block)
return "\n".join(p for p in parts if p)
return str(content)
def normalize_trajectory(traj):
if traj is None:
return []
if isinstance(traj, str):
try:
traj = json.loads(traj)
except Exception:
return []
norm = []
for msg in traj:
if isinstance(msg, str):
try:
msg = json.loads(msg)
except Exception:
msg = {"role": "unknown", "content": msg}
if isinstance(msg, dict):
norm.append(msg)
return norm
def normalize_metadata(meta):
if isinstance(meta, str):
try:
return json.loads(meta)
except Exception:
return {}
return meta if isinstance(meta, dict) else {}
def role_counts(trajectory):
c = Counter()
for msg in trajectory or []:
if isinstance(msg, dict):
c[msg.get("role", "unknown")] += 1
return c
_FUNC_XML = re.compile(r"<function\s*=\s*([a-zA-Z0-9_\-]+)", re.IGNORECASE)
_EXEC_TAG = re.compile(r"<(execute_[a-z]+)>", re.IGNORECASE)
_BASH_FENCE = re.compile(r"```(?:bash|sh|shell)\b", re.IGNORECASE)
def extract_tool_names(trajectory):
names = Counter()
for msg in trajectory or []:
if not isinstance(msg, dict):
continue
for call in msg.get("tool_calls") or []:
fn = (call or {}).get("function", {}) if isinstance(call, dict) else {}
if fn.get("name"):
names[fn["name"]] += 1
if msg.get("role") == "tool" and msg.get("name"):
names[msg["name"]] += 1
if msg.get("role") == "assistant":
text = message_text(msg)
for m in _FUNC_XML.findall(text):
names[m.lower()] += 1
for m in _EXEC_TAG.findall(text):
names[m.lower()] += 1
if _BASH_FENCE.search(text):
names["bash_block"] += 1
return names
def parse_patch(diff_text):
if not diff_text or not isinstance(diff_text, str):
return 0, 0, 0, [], Counter()
files, exts = [], Counter()
additions = deletions = 0
for line in diff_text.splitlines():
if line.startswith("diff --git"):
parts = line.split()
if len(parts) >= 3:
path = parts[2][2:] if parts[2].startswith("a/") else parts[2]
files.append(path)
base = path.split("/")[-1]
if "." in base:
exts[base.rsplit(".", 1)[-1].lower()] += 1
elif line.startswith("+") and not line.startswith("+++"):
additions += 1
elif line.startswith("-") and not line.startswith("---"):
deletions += 1
return len(files), additions, deletions, files, exts
def make_token_counter():
try:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
return lambda s: len(enc.encode(s, disallowed_special=()))
except Exception:
return lambda s: max(1, len(s) // 4)
count_tokens = make_token_counter()
We define helper functions that make the dataset easier to process, even when fields appear in different formats. We normalize trajectories, extract message text, count roles, detect tool usage, parse code patches, and estimate token lengths. We build these utilities defensively so that our analysis remains stable across schema variations in large streamed datasets.
Streaming and Inspecting Trajectories
def stream_take(agent, model, n):
ds = load_dataset(DATASET, agent, split=model, streaming=True)
rows = []
for ex in islice(ds, n):
ex = dict(ex)
ex["_agent"], ex["_model"] = agent, model
rows.append(ex)
return rows
banner("STEP 1 — Streaming trajectories from the Hub")
raw_rows = []
if SAMPLE_ALL:
combos = [(a, m) for a in AGENTS for m in MODELS]
for agent, model in combos:
try:
part = stream_take(agent, model, PER_COMBO)
raw_rows.extend(part)
print(f" ✓ {agent:<10} / {model:<12} -> {len(part):>4} rows")
except Exception as e:
print(f" ✗ {agent}/{model} failed: {type(e).__name__}: {e}")
else:
raw_rows = stream_take(AGENTS[0], MODELS[0], N_SINGLE)
print(f" ✓ {AGENTS[0]} / {MODELS[0]} -> {len(raw_rows)} rows")
print(f"\n Total rows pulled into memory: {len(raw_rows)}")
assert raw_rows, "No rows streamed — check your internet connection and retry."
banner("STEP 2 — Anatomy of a single record")
sample = raw_rows[0]
print("Top-level fields :", list(sample.keys()))
print("instance_id :", sample.get("instance_id"))
print("repo / language :", sample.get("repo"), "/", sample.get("language"))
print("license :", sample.get("license"))
print("resolved (1/0/-1):", sample.get("resolved"))
print("metadata :", normalize_metadata(sample.get("metadata")))
traj0 = normalize_trajectory(sample.get("trajectory"))
print(f"\nTrajectory has {len(traj0)} messages. Role histogram: {dict(role_counts(traj0))}")
print("\n--- Trajectory walkthrough (each message truncated to 240 chars) ---")
for i, msg in enumerate(traj0[:8]):
role = msg.get("role", "unknown").upper()
body = " ".join(message_text(msg).split())
print(f"\n[{i}] {role}")
print(textwrap.fill(body[:240] + ("…" if len(body) > 240 else ""),
width=92, subsequent_indent=" "))
if len(traj0) > 8:
print(f"\n… (+{len(traj0) - 8} more messages)")
print("\n--- Final patch (model_patch), first 25 lines ---")
print("\n".join((sample.get("model_patch") or "").splitlines()[:25]) or "(empty)")
We stream a small sample of Open-SWE-Traces directly from Hugging Face instead of downloading the full dataset. We collect examples across agent and model combinations, then inspect the structure of a single record in detail. We walk through the first few trajectory messages and preview the final patch to understand what each training example contains.
Building the Analysis DataFrame
banner("STEP 3 — Building the analysis DataFrame")
def process_example(ex):
traj = normalize_trajectory(ex.get("trajectory"))
rc = role_counts(traj)
nf, add, dele, _files, _exts = parse_patch(ex.get("model_patch"))
meta = normalize_metadata(ex.get("metadata"))
full_text = "\n".join(message_text(m) for m in traj)
return {
"instance_id": ex.get("instance_id"),
"repo": ex.get("repo"),
"language": (ex.get("language") or "unknown").lower(),
"license": ex.get("license"),
"resolved": ex.get("resolved"),
"agent": ex.get("_agent"),
"model": ex.get("_model"),
"n_messages": len(traj),
"n_system": rc.get("system", 0),
"n_user": rc.get("user", 0),
"n_assistant": rc.get("assistant", 0),
"n_tool": rc.get("tool", 0),
"patch_files": nf,
"patch_add": add,
"patch_del": dele,
"patch_churn": add + dele,
"traj_tokens": count_tokens(full_text),
"category": meta.get("category"),
"meta_files": meta.get("num_modified_files"),
"meta_lines": meta.get("num_modified_lines"),
"_tools": extract_tool_names(traj),
}
records = [process_example(ex) for ex in raw_rows]
df = pd.DataFrame(records)
df["is_resolved"] = (df["resolved"] == 1)
df["known_label"] = df["resolved"].isin([0, 1])
print(f"DataFrame: {df.shape[0]} rows x {df.shape[1]} cols")
print("\nNumeric summary:")
print(df[["n_messages", "n_assistant", "n_tool",
"patch_files", "patch_churn", "traj_tokens"]].describe().round(1))
We transform the raw streamed records into a structured pandas DataFrame for analysis. We extract trajectory-level features such as message counts, role counts, patch churn, token estimates, metadata fields, and tool-use counters. We also create resolution flags to compare successful and unsuccessful software-engineering trajectories.
Visualizing Trajectory Distributions
banner("STEP 4 — Distributions & visualizations")
lang_counts = df["language"].value_counts()
print("Trajectories per language:\n", lang_counts.to_string())
ax = lang_counts.plot(kind="bar", color=BLUE)
ax.set_title("Trajectories per language (sample)")
ax.set_xlabel(""); ax.set_ylabel("count")
plt.tight_layout(); plt.show()
known = df[df["known_label"]]
by_lang = (known.groupby("language")["is_resolved"]
.agg(rate="mean", n="size")
.query("n >= 25")
.sort_values("rate", ascending=False))
print("\nResolution rate by language (n>=25):\n", by_lang.round(3).to_string())
if not by_lang.empty:
ax = by_lang["rate"].plot(kind="bar", color=GREEN)
ax.set_title("Resolution rate by language")
ax.set_xlabel(""); ax.set_ylabel("fraction resolved"); ax.set_ylim(0, 1)
plt.tight_layout(); plt.show()
if known["agent"].nunique() > 1 or known["model"].nunique() > 1:
pivot = (known.groupby(["agent", "model"])["is_resolved"].mean().unstack())
print("\nResolution rate by scaffold x model:\n", pivot.round(3).to_string())
ax = pivot.plot(kind="bar", color=[BLUE, ORANGE])
ax.set_title("Resolution rate: scaffold x model")
ax.set_xlabel("agent"); ax.set_ylabel("fraction resolved"); ax.set_ylim(0, 1)
ax.legend(title="model"); plt.tight_layout(); plt.show()
ax = df["n_messages"].plot(kind="hist", bins=40, color=BLUE, alpha=0.85)
ax.set_title("Messages per trajectory")
ax.set_xlabel("number of messages"); ax.set_ylabel("trajectories")
plt.tight_layout(); plt.show()
churn = df["patch_churn"].clip(upper=df["patch_churn"].quantile(0.97))
ax = churn.plot(kind="hist", bins=40, color=ORANGE, alpha=0.85)
ax.set_title("Patch size — lines changed (clipped at p97)")
ax.set_xlabel("added + deleted lines"); ax.set_ylabel("trajectories")
plt.tight_layout(); plt.show()
if known["is_resolved"].nunique() > 1:
fig, ax = plt.subplots()
for flag, color, lab in [(True, GREEN, "resolved"), (False, RED, "unresolved")]:
sub = known[known["is_resolved"] == flag]
ax.scatter(sub["n_messages"], sub["traj_tokens"],
s=10, alpha=0.4, color=color, label=lab)
ax.set_title("Trajectory length vs. token size, by outcome")
ax.set_xlabel("messages"); ax.set_ylabel("estimated tokens")
ax.legend(); plt.tight_layout(); plt.show()
Analyzing Token Budget Requirements
banner("STEP 5 — Token budget (what context window do you need?)")
tok = df["traj_tokens"]
print("Estimated tokens per trajectory — percentiles:")
for p in [50, 75, 90, 95, 99]:
print(f" p{p:<2}: {int(tok.quantile(p/100)):>8,}")
print(f" max: {int(tok.max()):>8,}")
windows = [8_192, 16_384, 32_768, 65_536, 131_072]
print("\nFraction of trajectories that fit in a given context window:")
for w in windows:
frac = (tok <= w).mean()
print(f" {w:>7,} tokens : {frac*100:5.1f}%")
ax = tok.clip(upper=tok.quantile(0.99)).plot(kind="hist", bins=50,
color=BLUE, alpha=0.85)
for w, c in zip([8_192, 32_768, 131_072], [GREEN, ORANGE, RED]):
if w <= tok.quantile(0.99):
ax.axvline(w, color=c, ls="--", lw=1.5, label=f"{w//1024}k ctx")
ax.set_title("Trajectory token-length distribution (clipped at p99)")
ax.set_xlabel("estimated tokens"); ax.set_ylabel("trajectories")
ax.legend(); plt.tight_layout(); plt.show()
Measuring Agent Tool Usage
banner("STEP 6 — Which tools/actions do the agents use?")
tool_totals = Counter()
for t in df["_tools"]:
tool_totals.update(t)
top_tools = tool_totals.most_common(12)
if top_tools:
print("Most frequent agent actions (across the sample):")
for name, cnt in top_tools:
print(f" {name:<24} {cnt:>7,}")
labels, vals = zip(*top_tools)
fig, ax = plt.subplots(figsize=(9, 5))
ax.barh(range(len(labels)), vals, color=BLUE)
ax.set_yticks(range(len(labels))); ax.set_yticklabels(labels)
ax.invert_yaxis()
ax.set_title("Top agent actions / tool invocations")
ax.set_xlabel("count"); plt.tight_layout(); plt.show()
else:
print("No tool actions detected with the current heuristics.")
if known["is_resolved"].nunique() > 1:
print("\nMean 'tool' (environment) turns by outcome:")
print(known.groupby("is_resolved")["n_tool"].mean().round(2).to_string())
We explore the dataset through language counts, resolution rates, scaffold/model comparisons, message-length distributions, patch-size distributions, and token-budget analysis. We visualize how trajectory length, token size, and tool usage vary across the sampled records. We use these plots and summaries to determine which examples are practical to fine-tune under different context-window limits.
Building a Curated SFT Subset
banner("STEP 7 — Building a curated SFT subset")
def to_chatml(trajectory):
out = []
for m in trajectory:
role = m.get("role", "unknown")
out.append(f"<|im_start|>{role}\n{message_text(m).strip()}<|im_end|>")
return "\n".join(out)
def passes_filters(rec, raw):
if SFT_REQUIRE_RESOLVED and rec["resolved"] != 1:
return False
if rec["traj_tokens"] > MAX_SFT_TOKENS:
return False
if SFT_LANGUAGES is not None and rec["language"] not in SFT_LANGUAGES:
return False
if not (raw.get("model_patch") or "").strip():
return False
return True
sft_examples = []
for rec, raw in zip(records, raw_rows):
if not passes_filters(rec, raw):
continue
messages = [{"role": m.get("role"), "content": message_text(m)}
for m in normalize_trajectory(raw.get("trajectory"))]
sft_examples.append({
"instance_id": rec["instance_id"],
"repo": rec["repo"],
"language": rec["language"],
"agent": rec["agent"],
"model": rec["model"],
"messages": messages,
"text": to_chatml(messages),
"model_patch": raw.get("model_patch"),
"approx_tokens": rec["traj_tokens"],
})
print(f"Kept {len(sft_examples)} / {len(records)} trajectories after filtering")
print(f" filters -> resolved_only={SFT_REQUIRE_RESOLVED}, "
f"max_tokens={MAX_SFT_TOKENS:,}, languages={SFT_LANGUAGES or 'all'}")
if sft_examples:
kept = pd.DataFrame(sft_examples)
print("\nCurated subset by language:\n", kept["language"].value_counts().to_string())
print("\n--- One formatted SFT example (ChatML, truncated) ---")
print(sft_examples[0]["text"][:600], "…")
banner("STEP 8 — Exporting artifacts")
csv_path = "open_swe_traces_analysis.csv"
df.drop(columns=["_tools"]).to_csv(csv_path, index=False)
print(f" Wrote analysis table -> {csv_path} ({len(df)} rows)")
jsonl_path = "open_swe_sft.jsonl"
with open(jsonl_path, "w", encoding="utf-8") as f:
for ex in sft_examples:
f.write(json.dumps(ex, ensure_ascii=False) + "\n")
print(f" Wrote SFT dataset -> {jsonl_path} ({len(sft_examples)} rows)")
print("\nDone. In Colab, open the Files pane (folder icon, left) to download both.")
print("To load the SFT file later: datasets.load_dataset('json', "
"data_files='open_swe_sft.jsonl')")
We convert selected trajectories into an SFT-ready format using standardized message dictionaries and an optional ChatML-style text representation. We filter examples by resolution status, token budget, language selection, and patch availability to keep the curated subset useful for training. We finally export both the analysis CSV and the JSONL SFT dataset for reuse in later fine-tuning workflows.
Conclusion
In conclusion, we built a complete workflow to transform Open-SWE-Traces from a large, raw, agentic dataset into structured analytics and SFT-ready training data. We learned how to stream trajectories, inspect agent behavior, measure token budgets, compare scaffolds and models, analyze patch characteristics, and export both an analysis table and a JSONL training file. We now have a reusable framework that we can extend for larger sampling, language-specific fine-tuning, deeper tool-use analysis, and model-specific chat-template formatting.