# 微软 SkillOpt 仅凭一个训练好的 Markdown 文件即可提升 GPT-5.5 性能

- 来源：The Decoder：AI News（RSS）
- 作者：Jonathan Kemper
- 发布时间：2026-06-13 20:20
- AIHOT 分数：45
- AIHOT 链接：https://aihot.virxact.com/items/cmqcc77in01fbsl9iwcc6y0c9
- 原文链接：https://the-decoder.com/microsofts-skillopt-boosts-gpt-5-5-by-using-nothing-but-a-trained-markdown-file

## AI 摘要

微软与三所中国大学联合开发了 SkillOpt 方法，利用传统模型训练原理优化 AI 智能体的指令文档。仅需一个简单的 Markdown 文件，即可让 GPT-5.5 在程序化任务上提升约 23 分，且该文件能够跨模型和跨 Agent 环境（如 Codex 和 Claude Code）迁移。

## 正文

Microsoft's SkillOpt boosts GPT-5.5 by using nothing but a trained Markdown file

A simple Markdown file is apparently enough to boost GPT-5.5 by more than 20 points on procedural tasks. That's the promise of SkillOpt, a method from Microsoft and three Chinese universities that trains instruction documents for AI agents the same way model weights get trained.

These kinds of instruction documents, known as "skills," are already common in commercial products. Anthropic, for example, added a modular skill system to Claude last year that automatically loads topic-specific instructions, scripts, and resources depending on the task.

Skills typically bundle procedures, tool-use rules, output formats, and known failure patterns, and they've become a standard approach. Until now, according to the Microsoft team's paper, they were either written by hand, generated in a single pass by a language model, or loosely self-revised. None of these approaches behaves like a real optimizer, and none guarantees the skill actually improves.

The skill document becomes a trainable state

SkillOpt treats the skill document as an external, trainable state for a frozen target model. A second, separate language model acts as the optimizer. It reads logs from the agent's runs, spots recurring error and success patterns, and proposes limited edits to the skill: adding, deleting, or replacing individual passages. Each change is only accepted if it performs better on a held-out validation set.

The authors map several deep learning concepts onto the text level. A kind of learning rate caps how many edits can land per step. A scheduler shrinks the step size across epochs. Rejected edits go into a buffer and serve as negative examples for later reflection. A slow update at the end of each epoch preserves stable edit directions across training rounds, similar to how gradient smoothing works in traditional training.

What makes this practical is the clean split between training and deployment. The optimizer model only runs during training, and once that's done, it's out of the picture. At inference time, the target model simply receives a plain Markdown file of 300 to 2,000 tokens as context.

Beating every comparison method consistently

The authors tested their approach on six benchmarks covering search, spreadsheets, document analysis, math, and embodied action. Seven systems served as target models, including GPT-5.5 and the much smaller Qwen3.5-4B. Tasks ran in direct chat as well as in the agent environments Codex and Claude Code.

Across every combination, SkillOpt leads or ties with the best comparison result. That holds against handwritten skills, one-shot LLM-generated skills, and specialized methods like Trace2Skill, TextGrad, GEPA, and EvoSkill. On GPT-5.5 in direct chat, the average across all six benchmarks jumps by about 23 points.

The biggest gains show up on tasks with strict format requirements and tool use, like spreadsheet editing. Smaller models benefit too, which the authors take as evidence that a well-trained skill delivers procedural knowledge these models lack in their weights.

Skills transfer across models and environments

One key finding is transferability. A skill trained on a larger model also improves smaller models in the same family. A spreadsheet skill trained in the Codex loop works unchanged in Claude Code, lifting performance there to the same level as a skill trained directly in Claude Code. A math skill optimized on olympiad problems still delivers gains on a related benchmark without any retraining.

The ablation studies explain why the method stays stable. Without a bounded edit budget, the skill drifts too far with each revision. Without the buffer for rejected edits, the optimizer repeats the same failed attempts.

Removing the slow update at epoch's end costs SpreadsheetBench more than twenty points, the largest drop in the entire experiment. Only the combination of bounded step size, validation gating, negative feedback, and long-term consolidation makes skill training behave like a controlled optimization process, the authors say.

Short, readable documents do the heavy lifting

The final skills stay compact: The finished documents rarely exceed 2,000 tokens, and the improvements result from just one to four accepted edits across four training epochs. On OfficeQA, the largest gain came from a single accepted change.

The learned rules read as if an experienced practitioner had jotted them down after a day working with the benchmark. For spreadsheets, the skill learns to check the worksheet structure first and write directly evaluated values into the entire target range instead of using Excel formulas.

For ALFWorld, it keeps a log of visited locations and avoids heading to the goal before picking up the target object. For document questions, it anchors the question to the right table row before accepting an answer. None of these rules refer to a specific task. They describe procedures.

The authors acknowledge that the method depends on reliable automatic scoring. For open-ended tasks where success is hard to measure, the validation step would need human or model-based judgments. SkillOpt also deliberately optimizes a single document rather than a skill library, which could become a bottleneck for highly varied domains.

Where SkillOpt fits in the self-improvement race

While most current self-improvement approaches eventually tweak model weights, SkillOpt takes a remarkably lean path. OpenClaw-RL, a framework from Princeton researchers, uses follow-up signals from every interaction—like user responses or test results—as a live training source.

MetaClaw pulls compact behavioral rules from failed tasks and injects them into the prompt, updating weights only during idle phases via reinforcement learning. One parallel to SkillOpt: weaker models benefit the most in both cases because they lack procedural knowledge that a rule or skill can supply directly.

Other groups go further. AutoTTS lets a coding agent search for better reasoning control algorithms on its own, shifting the human role from designing rules to designing the environment. Meta's Hyperagents optimize the very mechanism they use to improve themselves. SkillOpt, by contrast, keeps the model frozen and changes nothing but a readable text file.

AI News Without the Hype – Curated by Humans