# GLM 5.2 登顶 PostTrainBench，得分 34.29%

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-06-26 11:55
- AIHOT 分数：43
- AIHOT 链接：https://aihot.virxact.com/items/cmqufekbs03xlsl80r4ycor7f
- 原文链接：https://x.com/rohanpaul_ai/status/2070355272892395887

## AI 摘要

GLM 5.2 以 34.29% 得分在 PostTrainBench 上排名第一。该基准测试 AI 智能体能否实际训练改进原始 LLM：智能体拿到 4 个小基座模型、1 块 H100 GPU 和 10 小时，需自主选择训练数据、编写训练代码、运行微调、修复失败并提交改进后模型。GLM 5.2 作为控制训练流程的智能体，评测其能否在限定条件下提升 4 个较弱 LLM。当前官方指令模型得分 51.14%，显示智能体后训练流程与更成熟的人工调优仍有差距。

## 正文

GLM 5.2 just took the top spot on PostTrainBench by scoring 34.29%.

PostTrainBench tests whether an AI agent can take a raw LLM and make it better by actually training it， not by answering the benchmark questions itself.

The agent gets 4 small base models， 1 H100 GPU， and 10h， then it must choose training data， write training code， run experiments， fix broken runs， and submit improved versions of those models.

So in this case， GLM 5.2 was the agent model controlling the training process， so PostTrainBench did not score GLM 5.2's own answers； it scored whether GLM 5.2 could take 4 weaker LLMs and improve them within 10h on 1 H100.

The gap to official instruct models， which score 51.14%， still shows how far agents are from mature post-training pipelines built with more data， compute， and human tuning.

GLM 5.2's job was to write training code， pick or make training data， run fine-tuning， fix failed runs， and submit the newly trained models for testing.
