找到第一个，跟踪下一个：在指代视频对象分割中解耦识别和传播

2503.03492v2

中文标题#

找到第一个，跟踪下一个：在指代视频对象分割中解耦识别和传播

英文标题#

Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

中文摘要#

参考视频对象分割旨在使用自然语言提示对视频中的目标对象进行分割和跟踪。现有方法通常以高度纠缠的方式融合视觉和文本特征，共同处理多模态信息以生成每帧的掩码。然而，这种方法在目标识别模糊的情况下经常遇到困难，特别是在存在多个相似对象的场景中，并且无法确保跨帧的掩码传播一致性。为了解决这些限制，我们引入了 FindTrack，一种高效的解耦框架，将目标识别与掩码传播分开。 FindTrack 首先通过平衡分割置信度和视觉 - 文本对齐来自适应地选择关键帧，为目标对象建立一个稳健的参考。然后，该参考被专门的传播模块用于在整个视频中跟踪和分割目标对象。通过解耦这些过程，FindTrack 有效地减少了目标关联中的歧义并增强了分割一致性。 FindTrack 在公共基准测试中显著优于所有现有方法，证明了其优越性。

英文摘要#

Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information together to generate per-frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, an efficient decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision-text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. FindTrack significantly outperforms all existing methods on public benchmarks, demonstrating its superiority.

PDF 获取#

查看中文 PDF - 2503.03492v2

智能达人抖店二维码

抖音扫码查看更多精彩内容