ViewActive：単一画像からのアクティブ視点最適化

2409.09997v5

日本語タイトル#

ViewActive：単一画像からのアクティブ視点最適化

英文タイトル#

ViewActive: Active viewpoint optimization from a single image

日本語摘要#

物体を観察する際に、人間はその空間視覚化能力と心理的回転能力を活用して、現在の観察に基づいて潜在的な最適視点を想像することができます。この能力は、ロボットが操作中に効率的かつ堅牢なシーン認識を実現するために重要です。最適視点は、2D 画像内のシーンを正確に表現するための重要かつ情報豊富な特徴を提供し、下流のタスクを向上させます。この人間のようなアクティブ視点最適化能力をロボットに与えるために、我々は ViewActive を提案します。これは、視点グラフからインスパイアを受けた現代的な機械学習アプローチであり、現在の 2D 画像入力に基づいて視点最適化のガイダンスを提供します。具体的には、3D 視点品質場（VQF）を導入します。これは、視点グラフに似たコンパクトで一貫した視点品質分布の表現であり、自己遮蔽比率、占有認識表面法線エントロピー、視覚エントロピーの 3 つの汎用視点品質指標で構成されています。事前学習された画像エンコーダを使用して堅牢な視覚的および意味的特徴を抽出し、それを 3D VQF にデコードすることで、我々のモデルは未見のカテゴリを含む多様な物体に対して効果的に一般化できるようになります。軽量な ViewActive ネットワーク（単一 GPU 上で 72 FPS）は、最先端の物体認識パイプラインの性能を大幅に向上させ、ロボットアプリケーションのリアルタイム運動計画に統合できます。我々のコードとデータセットはここで入手可能です：https://github.com/jiayi-wu-umd/ViewActive.

英文摘要#

When observing objects, humans benefit from their spatial visualization and mental rotation ability to envision potential optimal viewpoints based on the current observation. This capability is crucial for enabling robots to achieve efficient and robust scene perception during operation, as optimal viewpoints provide essential and informative features for accurately representing scenes in 2D images, thereby enhancing downstream tasks. To endow robots with this human-like active viewpoint optimization capability, we propose ViewActive, a modernized machine learning approach drawing inspiration from aspect graph, which provides viewpoint optimization guidance based solely on the current 2D image input. Specifically, we introduce the 3D Viewpoint Quality Field (VQF), a compact and consistent representation of viewpoint quality distribution similar to an aspect graph, composed of three general-purpose viewpoint quality metrics: self-occlusion ratio, occupancy-aware surface normal entropy, and visual entropy. We utilize pre-trained image encoders to extract robust visual and semantic features, which are then decoded into the 3D VQF, allowing our model to generalize effectively across diverse objects, including unseen categories. The lightweight ViewActive network (72 FPS on a single GPU) significantly enhances the performance of state-of-the-art object recognition pipelines and can be integrated into real-time motion planning for robotic applications. Our code and dataset are available here: https://github.com/jiayi-wu-umd/ViewActive.

PDF 获取#

中文 PDF を表示 - 2409.09997v5

スマート達人抖店 QR コード

抖音でスキャンしてさらに素晴らしいコンテンツを確認