モデルはキャッシュすべきコンテンツを明らかにしました：ビデオ拡散モデルの解析に基づく特徴の再利用

2504.03140v2

日本語タイトル#

モデルがキャッシュすべき内容を明らかにする：ビデオ拡散モデルのプロファイリングに基づく特徴再利用

英文タイトル#

Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models

日本語摘要#

拡散モデルの最新の進展は、ビデオ生成において顕著な能力を示しています。しかし、計算の強度は実際のアプリケーションにおける重大な課題のままです。特徴キャッシュが拡散モデルの計算負担を軽減するために提案されていますが、既存の方法は通常、個々のブロックの異質な重要性を無視しており、最適でない再利用と出力品質の低下を招いています。このため、私たちは ProfilingDiT という新しい適応型キャッシング戦略を導入することでこのギャップに対処します。この戦略は、前景と背景に焦点を当てたブロックを明示的に切り離します。拡散モデルにおける注意分布の体系的な分析を通じて、私たちは重要な観察を明らかにしました：1）ほとんどの層は前景または背景領域に対して一貫した好みを示します。2）予測されたノイズは初期段階では低いステップ間の類似性を示し、去ノイズ処理が進むにつれてこの類似性は安定します。この発見は、動的前景要素の完全な計算を保持しつつ、静的背景特徴を効率的にキャッシュする選択的キャッシング戦略を策定するインスピレーションを与えます。私たちのアプローチは、計算オーバーヘッドを大幅に削減しながら、視覚的忠実度を保持します。広範な実験は、私たちのフレームワークが包括的な品質指標において視覚的忠実度を維持しながら、顕著な加速を達成することを示しています（例：Wan2.1 の加速比は 2.01 倍）、効率的なビデオ生成の実行可能な方法を確立します。

英文摘要#

Recent advances in diffusion models have demonstrated remarkable capabilities in video generation. However, the computational intensity remains a significant challenge for practical applications. While feature caching has been proposed to reduce the computational burden of diffusion models, existing methods typically overlook the heterogeneous significance of individual blocks, resulting in suboptimal reuse and degraded output quality. To this end, we address this gap by introducing ProfilingDiT, a novel adaptive caching strategy that explicitly disentangles foreground and background-focused blocks. Through a systematic analysis of attention distributions in diffusion models, we reveal a key observation: 1) Most layers exhibit a consistent preference for either foreground or background regions. 2) Predicted noise shows low inter-step similarity initially, which stabilizes as denoising progresses. This finding inspires us to formulate a selective caching strategy that preserves full computation for dynamic foreground elements while efficiently caching static background features. Our approach substantially reduces computational overhead while preserving visual fidelity. Extensive experiments demonstrate that our framework achieves significant acceleration (e.g., 2.01 times speedup for Wan2.1) while maintaining visual fidelity across comprehensive quality metrics, establishing a viable method for efficient video generation.

PDF 获取#

查看中文 PDF - 2504.03140v2

スマート達人抖店 QR コード

抖音でスキャンしてさらに素晴らしいコンテンツを見る