SIA：透過意圖感知增強視覺語言模型的安全性

2507.16856v1

中文标题#

SIA：通過意圖感知增強視覺語言模型的安全性

英文标题#

SIA: Enhancing Safety via Intent Awareness for Vision-Language Models

中文摘要#

隨著視覺 - 語言模型（VLMs）在現實應用中的日益部署，圖像和文本之間的微妙相互作用帶來了新的安全風險。特別是，看似無害的輸入可以組合起來揭示有害意圖，導致不安全的模型響應。儘管對多模態安全的關注不斷增加，但基於事後再過濾或靜態拒絕提示的先前方法在檢測此類潛在風險方面存在困難，尤其是在有害性僅從輸入組合中出現時。我們提出了 SIA（通過意圖感知實現安全），這是一種無需訓練的提示工程框架，能夠主動檢測並緩解多模態輸入中的有害意圖。 SIA 採用了一個三階段推理過程：(1) 通過描述進行視覺抽象，(2) 通過少量示例的思維鏈提示進行意圖推斷，(3) 通過意圖條件響應優化。 SIA 不依賴於預定義的規則或分類器，而是動態適應從圖像 - 文本對中推斷出的隱含意圖。通過在包括 SIUO、MM-SafetyBench 和 HoliSafe 在內的安全關鍵基準上的廣泛實驗，我們證明了 SIA 實現了顯著的安全改進，優於之前的方法。儘管 SIA 在 MMStar 上的通用推理準確性略有下降，但相應的安全收益突顯了意圖感知推理在將 VLMs 與以人為本的價值觀對齊中的價值。

英文摘要#

As vision-language models (VLMs) are increasingly deployed in real-world applications, new safety risks arise from the subtle interplay between images and text. In particular, seemingly innocuous inputs can combine to reveal harmful intent, leading to unsafe model responses. Despite increasing attention to multimodal safety, previous approaches based on post hoc filtering or static refusal prompts struggle to detect such latent risks, especially when harmfulness emerges only from the combination of inputs. We propose SIA (Safety via Intent Awareness), a training-free prompt engineering framework that proactively detects and mitigates harmful intent in multimodal inputs. SIA employs a three-stage reasoning process: (1) visual abstraction via captioning, (2) intent inference through few-shot chain-of-thought prompting, and (3) intent-conditioned response refinement. Rather than relying on predefined rules or classifiers, SIA dynamically adapts to the implicit intent inferred from the image-text pair. Through extensive experiments on safety-critical benchmarks including SIUO, MM-SafetyBench, and HoliSafe, we demonstrate that SIA achieves substantial safety improvements, outperforming prior methods. Although SIA shows a minor reduction in general reasoning accuracy on MMStar, the corresponding safety gains highlight the value of intent-aware reasoning in aligning VLMs with human-centric values.

PDF 獲取#

查看中文 PDF - 2507.16856v1

智能達人抖店二維碼

抖音掃碼查看更多精彩內容