DataClaw_0-9B:从原始流中智能体化定制多模态数据
阅读原文· arxiv.orgDataClaw_0-9B提出主动智能体化数据定制范式,将数据处理提升为可学习能力。通过两阶段pipeline将生成语义合成锚定于确定性事实锚点,构建覆盖五个物理与数字域的大规模数据集,并采用SFT与GRPO实现与复杂定制意图对齐。同时构建首个数据精炼基准DataClaw_0-val,在视频生成、真实世界VQA与GUI导航下游任务中验证了其提供高信息密度数据的能力。
Massive unstructured multimodal streams suffer from high "data entropy," impeding both efficient human knowledge acquisition and high-quality AI post-training. Existing passive annotation paradigms, heavily reliant on heuristic rules or general VLMs, are costly, monotonous, and fail to unlock the deep procedural logic embedded in raw data. We elevate data processing to a learnable capability, proposing a paradigm shift towards Agentic Data Tailoring, which actively refining and structuring data to align with diverse user and downstream intents. To overcome the data scarcity bottleneck in training such high-order capabilities, we design a two-stage pipeline grounding generative semantic synthesis in deterministic Factual Anchors, yielding a large-scale dataset spanning five core physical and digital domains. Building upon this, DataClaw_0-9B model synergizes Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), achieving robust alignment with complex refinement and tailoring intents. To systematically quantify this capability, we construct DataClaw_0-val, the first benchmark dedicated to data refinement. Crucially, we adopt downstream post-training as the ultimate validation touchstone. Evaluations on video generation, real-world VQA, and GUI navigation confirm that DataClaw_0 delivers high-information-density tailored data, facilitating efficient model adaptation to new tasks under limited training data regimes. Project page: https://czjdsg.github.io/MakeAnyData