# PsychoSafe：引导大语言模型生成心理学知情拒绝

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-09 00:19
- AIHOT 分数：61
- AIHOT 链接：https://aihot.virxact.com/items/cmq7u3iet01pgslepe9rkhkzm
- 原文链接：https://arxiv.org/abs/2606.09697

## AI 摘要

PsychoSafe 是一种心理学导向的拒绝框架，将大语言模型的拒绝行为重构为结构化支持性沟通，基于循证干预策略。研究构建了包含8019条提示-响应对的语料库，覆盖五个高风险心理领域，采用提示工程和参数高效微调训练 Qwen 3.5 27B。在500条提示的验证集上，PsychoSafe 提示使拒绝质量较通用基线提升28.1%，其中外部资源转介提升46.8%、心理基础性提升34.8%，且不损害非拒绝任务性能。微调实现了近乎完美的拒绝与资源转介率，但降低了回复相关性。在 SORRY-Bench 和 XSTest 上表现强域内鲁棒性，但跨域泛化有限。

## 正文

Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.
