几何金丝雀：基于表征稳定性预测可控性与检测漂移

2026-04-20 08:00·74天前

AI 摘要

几何稳定性为语言模型部署提供双重诊断。监督式Shesha通过测量任务对齐的表征稳定性，在35-69个模型中以0.89-0.97相关系数精准预测线性可控性；无监督稳定性虽在可控性预测上失效（ρ≈0.10），却在漂移检测中表现优异：较CKA捕捉近2倍（Llama中5.23倍）几何变化，于73%模型中提前预警，假阳性率较Procrustes低6倍。两者分别适用于部署前可控性评估与部署后监控。

原文 · 未翻译

Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy (ρ= 0.89-0.97) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial ρ= 0.62-0.76). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks (ρapprox 0.10), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly 2times greater geometric change than CKA during post-training alignment (up to 5.23times in Llama) while providing earlier warning in 73\% of models and maintaining a 6times lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.

HuggingFace Daily Papers（社区热门论文）

导出 Markdown

几何金丝雀：基于表征稳定性预测可控性与检测漂移

2026-04-20 08:00·74天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译