展示而非讲述:可解释的AI生成文本检测系统TELL
阅读原文· arxiv.org针对现有AI文本检测器仅提供分数而缺乏解释性、难以应用于教学等场景的问题,研究团队提出了TELL架构。该系统旨在为用户提供文本被判定为AI或人类撰写的“特征提示”,赋能用户基于自身判断进行决策。TELL在特定领域作者标注数据集上训练,并采用GRPO和课程学习进行优化。在保持与前沿检测器可比性能的同时,系统能原生输出解释性标注。其解释质量在人类评估中,在具体性、可证伪性等多维度上取得了平均72.3%的胜率。
Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.