OpenSU3D: Open World 3D Scene Understanding using Foundation Models

Rafay Mohiuddin1*, Sai Manoj Prakhya2, Fiona Collins1, Ziyuan Liu2, Andre Borrmann1,
1Technical University of Munich, 2Huawei Intelligent Cloud Technologies Lab
Intro PNG

OpenSU3D constructs open-set 3D scene representation that that facilitate varity of open world scene understanding tasks.

Abstract

This study presents a novel, scalable approach for constructing open set, instance-level 3D scene representations, advancing open world understanding of 3D environments. Existing methods require pre-constructed 3D scenes and face scalability issues due to per-point feature vector learning, limit- ing their efficacy with complex queries. Our method overcomes these limitations by incrementally building instance-level 3D scene representations using 2D foundation models, efficiently aggregating instance-level details such as masks, feature vectors, names, and captions. We introduce fusion schemes for feature vectors to enhance contextual integration and performance on complex queries. Additionally, we explore large language models for robust automatic annotation and spatial reasoning tasks. Evaluated on scenes from ScanNet [1] and Replica [2] datasets, our method demonstrates zero-shot generalization capabilities, exceeds current state-of-the-art methods in open world 3D scene understanding.

Video

Comparision

Open Vocabulary Queries

Instance Queries

Affordance Queries

Property Queries

Relative Queries

Feature Fusion Schemes

Instance Segmentation

Automatic Annotation

Spatial Reasoning

Related Links

Our work emerged alongside a lot of other great research on 3D scene understanding. The fast pace of AI research makes it really hard to stay on top of all the new studies happening in these fields. Several concurrent methods OpenIns3D, Segment3D, SayPlan, LangSplat etc. explore open world 3D scene understanding.

Among them, OVSG and ConceptGraph align closely with our incremental, scalable instance-based representation approach.

Our approch rely only on geometric principles for merging 3D masks. While others build scene graphs for spatial reasoning, we leverage large language model's innate reasoning abilities through tailored prompts.

BibTeX

@article{rafay2021opensu3d,
  author    = {Rafay Mohiuddin and Sai Manoj Prakhya and Fiona Collins and Ziyuan Liu and André Borrmann},
  title     = {OpenSU3D:Open Wold 3D Scene Understanding using Foundation Models},
  journal   = {arXiv preprint 2407.14279},
  year      = {2024},
}
  This work has been submitted to the IEEE for possible publication.
  Copyright may be transferred without notice, after which this version may no longer be accessible.