机制性异常检测研究进展

2024-08-06 00:00·696天前

AI 摘要

这是一份关于机制性异常检测研究项目的中期进展报告，总结了该领域正在进行的工作。目前报告处于 interim 阶段，重点介绍通过理解模型内部机制来识别异常行为的技术路径，但尚未披露具体的技术突破、实验数据或性能指标。后续完整版本将提供更详细的方法论和实证结果。

原文 · 未翻译

Results Online detectors Aggregated AUROC by online score and features: all datasets Aggregated AUROC by online score and features: by dataset Layerwise AUROC by online score and features: by dataset Offline detectors Aggregated AUROC by offline score and features: all datasets Aggregated AUROC by offline score and features: by dataset Layerwise AUROC by offline score and features: by dataset

Online detectors Aggregated AUROC by online score and features: all datasets Aggregated AUROC by online score and features: by dataset Layerwise AUROC by online score and features: by dataset

Aggregated AUROC by online score and features: all datasets

Aggregated AUROC by online score and features: by dataset

Layerwise AUROC by online score and features: by dataset

Offline detectors Aggregated AUROC by offline score and features: all datasets Aggregated AUROC by offline score and features: by dataset Layerwise AUROC by offline score and features: by dataset

Aggregated AUROC by offline score and features: all datasets

Aggregated AUROC by offline score and features: by dataset

Layerwise AUROC by offline score and features: by dataset

Adversarial image detection

Visualising features Population Activations Layer 1 Layer 16 Layer 28 Attention head mean ablations Layer 1 Layer 16 Layer 28 Probe shift Layer 4 Layer 16 Layer 28 Sentiment Activations Layer 1 Layer 16 Layer 28 Attention head mean ablations Layer 1 Layer 16 Layer 28 Probe shift Layer 4 Layer 16 Layer 28 Discovering functional elements of the network with edge attribution patching

Population Activations Layer 1 Layer 16 Layer 28 Attention head mean ablations Layer 1 Layer 16 Layer 28 Probe shift Layer 4 Layer 16 Layer 28

Activations Layer 1 Layer 16 Layer 28

Layer 1

Layer 16

Layer 28

Attention head mean ablations Layer 1 Layer 16 Layer 28

Layer 1

Layer 16

Layer 28

Probe shift Layer 4 Layer 16 Layer 28

Layer 4

Layer 16

Layer 28

Sentiment Activations Layer 1 Layer 16 Layer 28 Attention head mean ablations Layer 1 Layer 16 Layer 28 Probe shift Layer 4 Layer 16 Layer 28

Activations Layer 1 Layer 16 Layer 28

Layer 1

Layer 16

Layer 28

EleutherAI：Blog

导出 Markdown

机制性异常检测研究进展

2024-08-06 00:00·696天前

阅读原文· blog.eleuther.ai

AI 摘要

原文 · 保持原样，未翻译

Online detectors Aggregated AUROC by online score and features: all datasets Aggregated AUROC by online score and features: by dataset Layerwise AUROC by online score and features: by dataset

Aggregated AUROC by online score and features: all datasets

Aggregated AUROC by online score and features: by dataset

Layerwise AUROC by online score and features: by dataset

Offline detectors Aggregated AUROC by offline score and features: all datasets Aggregated AUROC by offline score and features: by dataset Layerwise AUROC by offline score and features: by dataset