記憶アンカーに基づく多モーダル推論は、説明可能なビデオ証拠に使用されます

2508.14581v1

日本語タイトル#

メモリアンカーに基づく多モーダル推論による説明可能なビデオフォレンジック

英文タイトル#

Memory-Anchored Multimodal Reasoning for Explainable Video Forensics

日本語要約#

私たちは、メモリガイド検索、構造化された観察 - 思考 - 行動推論ループ、および適応型フォレンジックツール呼び出しを組み合わせた統一フレームワークである FakeHunter を提案し、堅牢性と解釈可能性を必要とする多モーダル深層偽造検出の問題に取り組みます。対照言語 - 画像事前学習（CLIP）モデルからの視覚表現と対照言語 - 音声事前学習（CLAP）モデルからの音声表現は、大規模メモリから意味的に整合した本物の例を検索し、疑わしい改ざんの反復的な位置特定と説明を導く文脈アンカーを提供します。内部信頼度が低い場合、このフレームワークは、オペークなマージナルスコアに依存するのではなく、区別的証拠を収集するために空間領域のズームやメルスペクトログラム検査などの細かい分析を選択的にトリガーします。また、操作タイプ、影響を受けた領域またはエンティティ、推論カテゴリ、および説明的根拠の詳細な注釈を持つ包括的な音声視覚偽造ベンチマークである X-AVFake もリリースし、文脈に基づく説明の信頼性を強調することを目的としています。広範な実験により、FakeHunter は強力な多モーダルベースラインを超えており、アブレーション研究は、文脈検索と選択的ツールの活性化が堅牢性と説明精度の向上に不可欠であることを確認しています。

英文摘要#

We address multimodal deepfake detection requiring both robustness and interpretability by proposing FakeHunter, a unified framework that combines memory guided retrieval, a structured Observation-Thought-Action reasoning loop, and adaptive forensic tool invocation. Visual representations from a Contrastive Language-Image Pretraining (CLIP) model and audio representations from a Contrastive Language-Audio Pretraining (CLAP) model retrieve semantically aligned authentic exemplars from a large scale memory, providing contextual anchors that guide iterative localization and explanation of suspected manipulations. Under low internal confidence the framework selectively triggers fine grained analyses such as spatial region zoom and mel spectrogram inspection to gather discriminative evidence instead of relying on opaque marginal scores. We also release X-AVFake, a comprehensive audio visual forgery benchmark with fine grained annotations of manipulation type, affected region or entity, reasoning category, and explanatory justification, designed to stress contextual grounding and explanation fidelity. Extensive experiments show that FakeHunter surpasses strong multimodal baselines, and ablation studies confirm that both contextual retrieval and selective tool activation are indispensable for improved robustness and explanatory precision.

文章ページ#

メモリアンカーに基づく多モーダル推論による説明可能なビデオフォレンジック

PDF 入手#

日本語 PDF を表示 - 2508.14581v1

スマート達人の抖店 QR コード

抖音扫码查看更多精彩内容