Opik发布Test Suites功能,将生产环境中的真实失败trace转化为回归测试。通过人工编写assertion(如"回复简洁"或"先询问再行动")定义期望行为,而非简单字符串匹配。团队可将测试集成至CI流程,在代码变更时自动检测行为退化。这种方法让AI代理质量评估从主观直觉转向基于真实证据的可重复验证,避免修复单问题时意外破坏其他场景。
Opik just launched Test Suites, a way to turn real agent traces into regression tests so teams can catch behavior breakage before shipping changes.
The problem is that agent failures are rarely one clean bug, because fixing one answer style, tool call, or retrieval path can quietly hurt other users.
Opik's approach is to treat a bad production trace as the test case, then attach a human-written assertion that states the behavior you actually want.
That matters because agent quality is usually fuzzy, so a check like "3 sentences or fewer" is often more useful than a pass-fail unit test.
The workflow is simple: find a failure, write the assertion, save both in a Test Suite, then run that suite in CI every time the agent changes.
This could give agent teams something they badly need: a repeatable way to improve behavior with evidence from real usage instead of gut feel.