HOLO: Holistic Lightweight Optimization for Scene Understanding with Auto-Annotation and Multimodal Learning
Abstract
Vision-language models (VLMs) have achieved remarkable success across various domains. However, their application to 3D scene understanding remains largely underexplored. Existing 3D VLMs predominantly focus on object-level tasks and often emphasize instance-centric representations within a scene, lacking holistic scene-level descriptions. In this work, we propose an automated annotation framework that leverages multi-view images to partition 3D scenes into localized point cloud sub-regions, which are then enriched with precise semantic information-all without any manual intervention. We processed ScanNet v2 and ScanNet++ to construct SceneCap, a large-scale dataset designed for scene-level description. To demonstrate the benefits of our framework for scene understanding , we introduce Natural Interactive Universal Multimodal Observer, NIUMO-LLM, a lightweight yet high-performing model that adapts vision-language capabilities for comprehensive 3D scene understanding through training on SceneCap. We further demonstrate that NIUMO-LLM achieves state-of-the-art (SOTA) performance on both scene description benchmarks and object-level tasks, requiring only 12 hours of training on a single NVIDIA A800 GPU. This design significantly reduces computational demands, lowering the barrier for MLLM-related research. For review purposes, we anonymously share representative samples at https://randomname432.github.io/HOLO/.