SIA：通过意图感知增强视觉语言模型的安全性

2507.16856v1

中文标题#

SIA：通过意图感知增强视觉语言模型的安全性

英文标题#

SIA: Enhancing Safety via Intent Awareness for Vision-Language Models

中文摘要#

随着视觉 - 语言模型（VLMs）在现实应用中的日益部署，图像和文本之间的微妙相互作用带来了新的安全风险。特别是，看似无害的输入可以组合起来揭示有害意图，导致不安全的模型响应。尽管对多模态安全的关注不断增加，但基于事后再过滤或静态拒绝提示的先前方法在检测此类潜在风险方面存在困难，尤其是在有害性仅从输入组合中出现时。我们提出了 SIA（通过意图感知实现安全），这是一种无需训练的提示工程框架，能够主动检测并缓解多模态输入中的有害意图。 SIA 采用了一个三阶段推理过程：(1) 通过描述进行视觉抽象，(2) 通过少量示例的思维链提示进行意图推断，(3) 通过意图条件响应优化。 SIA 不依赖于预定义的规则或分类器，而是动态适应从图像 - 文本对中推断出的隐含意图。通过在包括 SIUO、MM-SafetyBench 和 HoliSafe 在内的安全关键基准上的广泛实验，我们证明了 SIA 实现了显著的安全改进，优于之前的方法。尽管 SIA 在 MMStar 上的通用推理准确性略有下降，但相应的安全收益突显了意图感知推理在将 VLMs 与以人为本的价值观对齐中的价值。

英文摘要#

As vision-language models (VLMs) are increasingly deployed in real-world applications, new safety risks arise from the subtle interplay between images and text. In particular, seemingly innocuous inputs can combine to reveal harmful intent, leading to unsafe model responses. Despite increasing attention to multimodal safety, previous approaches based on post hoc filtering or static refusal prompts struggle to detect such latent risks, especially when harmfulness emerges only from the combination of inputs. We propose SIA (Safety via Intent Awareness), a training-free prompt engineering framework that proactively detects and mitigates harmful intent in multimodal inputs. SIA employs a three-stage reasoning process: (1) visual abstraction via captioning, (2) intent inference through few-shot chain-of-thought prompting, and (3) intent-conditioned response refinement. Rather than relying on predefined rules or classifiers, SIA dynamically adapts to the implicit intent inferred from the image-text pair. Through extensive experiments on safety-critical benchmarks including SIUO, MM-SafetyBench, and HoliSafe, we demonstrate that SIA achieves substantial safety improvements, outperforming prior methods. Although SIA shows a minor reduction in general reasoning accuracy on MMStar, the corresponding safety gains highlight the value of intent-aware reasoning in aligning VLMs with human-centric values.

PDF 获取#

查看中文 PDF - 2507.16856v1

智能达人抖店二维码

抖音扫码查看更多精彩内容