找到第一個，追蹤下一個：在指代視頻對象分割中解耦識別和傳播

2503.03492v2

中文标题#

找到第一個，追蹤下一個：在指代視頻對象分割中解耦識別和傳播

英文标题#

Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

中文摘要#

參考視頻對象分割旨在使用自然語言提示對視頻中的目標對象進行分割和追蹤。現有方法通常以高度糾纏的方式融合視覺和文本特徵，共同處理多模態信息以生成每幀的掩碼。然而，這種方法在目標識別模糊的情況下經常遇到困難，特別是在存在多個相似對象的場景中，並且無法確保跨幀的掩碼傳播一致性。為了解決這些限制，我們引入了 FindTrack，一種高效的解耦框架，將目標識別與掩碼傳播分開。 FindTrack 首先通過平衡分割置信度和視覺 - 文本對齊自適應地選擇關鍵幀，為目標對象建立一個穩健的參考。然後，該參考被專門的傳播模塊用於在整個視頻中追蹤和分割目標對象。通過解耦這些過程，FindTrack 有效地減少了目標關聯中的歧義並增強了分割一致性。 FindTrack 在公共基準測試中顯著優於所有現有方法，證明了其優越性。

英文摘要#

Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information together to generate per-frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, an efficient decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision-text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. FindTrack significantly outperforms all existing methods on public benchmarks, demonstrating its superiority.

PDF 获取#

查看中文 PDF - 2503.03492v2

智能達人抖店二維碼

抖音掃碼查看更多精彩內容