# 针对LLM生成代码片段的可扩展高效溯源追踪

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-27 08:00
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmppete0j0eibslv4hvg1djum
- 原文链接：https://arxiv.org/abs/2605.28510

## AI 摘要

针对大语言模型生成代码可能无意识复制训练数据并引发版权问题，研究提出了SOURCETRACKER编码器及混合溯源流水线HYBRIDSOURCETRACKER。该系统在THESTACKV2数据集子集上训练与评估，在包含改编片段的10万片段搜索空间中，对于60-token及以上的窗口，其性能稳定超越传统Winnowing算法达5.4%，并保持对数时间查询复杂度。基于LLM的评估显示，许多检索到的片段仍与预期源代码高度相似，具有实用价值。

## 正文

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.
