Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

1National Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI) 2Department of Computer Science and Technology, Tsinghua University

TL;DR: We propose Semantic Gaussians, a versatile framework to conduct open-vocabulary scene understanding on off-the-shelf 3D Gaussian Splatting scenes.



Figure 1. Overview of our Semantic Gaussians. We inject semantic features into off-the-shelf 3D Gaussian Splatting by either projecting semantic features from pre-trained 2D encoders or directly predicting pointwise embeddings by a 3D semantic network (or fusing these two). The newly added semantic components of 3D Gaussians open up diverse applications centered around open-vocabulary scene understanding.

Abstract

Open-vocabulary 3D scene understanding presents a significant challenge in computer vision, with wide-ranging applications in embodied agents and augmented reality systems. Existing methods adopt neurel rendering methods as 3D representations and jointly optimize color and semantic features to achieve rendering and scene understanding simultaneously. In this paper, we introduce Semantic Gaussians, a novel open-vocabulary scene understanding approach based on 3D Gaussian Splatting. Our key idea is to distill knowledge from 2D pre-trained models to 3D Gaussians. Unlike existing methods, we design a versatile projection approach that maps various 2D semantic features from pre-trained image encoders into a novel semantic component of 3D Gaussians, which is based on spatial relationship and need no additional training. We further build a 3D semantic network that directly predicts the semantic component from raw 3D Gaussians for fast inference. The quantitative results on ScanNet segmentation and LERF object localization demonstates the superior performance of our method. Additionally, we explore several applications of Semantic Gaussians including object part segmentation, instance segmentation, scene editing, and spatiotemporal segmentation with better qualitative results over 2D and 3D baselines, highlighting its versatility and effectiveness on supporting diverse downstream tasks.


Figure 2. An illustration of the pipeline of Semantic Gaussians. Upper left: our projection framework maps various pre-trained 2D features to the semantic component s2D of 3D Gaussians; Bottom left: we additionally introduce a 3D semantic network that directly predicts the semantic components s3D out of raw 3D Gaussians. It is supervised by the projected s2D; Right: given an open-vocabulary text query, we compare its embedding against the semantic components (s2D, s3D, or their fusion) of 3D Gaussians. The matched Gaussians will be splatted to render the 2D mask corresponding to the query.

Semantic Segmentation Results

Some results of semantic segmentation on the ScanNet-20 dataset.

Part Segmentation Results

Some results of part segmentation on the MVImgNet dataset.

Spatiotemporal Tracking Results

The demo of spatiotemporal tracking on human parts and a basketball on the CMU Panoptic dataset.

Language-Guided Editing Results

Some language-guided editing result on the room scene of the Mip-NeRF 360 dataset.

BibTeX

@misc{guo2024semantic,
        title={Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting}, 
        author={Jun Guo and Xiaojian Ma and Yue Fan and Huaping Liu and Qing Li},
        year={2024},
        eprint={2403.15624},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }