LatentAM

We present LatentAM, an online 3D Gaussian Splatting (3DGS) mapping framework that builds scalable latent feature maps from streaming RGB-D observations for open-vocabulary robotic perception. Instead of distilling high-dimensional Vision-Language Model (VLM) embeddings using model-specific decoders, LatentAM proposes an online dictionary learning approach that is both model-agnostic and pretraining-free, enabling plug-and-play integration with different VLMs at test time. Specifically, our approach associates each Gaussian primitive with a compact query vector that can be converted into approximate VLM embeddings using an attention mechanism with a learnable dictionary. The dictionary is initialized efficiently from streaming observations and optimized online to adapt to evolving scene semantics under trust-region regularization. To scale to long trajectories and large environments, we further propose an efficient map management strategy based on voxel hashing, where optimization is restricted to an active local map on the GPU, while the global map is stored and indexed on the CPU to maintain bounded GPU memory usage. Experiments on public benchmarks and a large-scale custom dataset demonstrate that LatentAM attains significantly better feature reconstruction fidelity compared to state-of-the-art methods, while achieving near-real-time speed (12-35 FPS) on the evaluated datasets.

Model-agnostic, pretraining-free, real-time latent feature mapping: LatentAM represents each 3D Gaussian with a compact query vector and reconstructs high-dimensional VLM embeddings via Gaussian memory attention over a learnable dictionary, enabling plug-and-play use with different VLMs (e.g., CLIP / DINOv3 / LSeg).
Online dictionary learning: The dictionary is initialized and updated online using streaming K-means and historical learnings, then trained with trust-region regularization to reduce overfitting / catastrophic forgetting over long trajectories.
Scalable local-global map management: LatentAM scales to large environments using voxel hashing and bounded GPU memory by optimizing only an active local map on the GPU while storing/indexing the global map on the CPU, achieving near-real-time performance and long-term, large-scale mapping.

BibTeX

@article{lee2026latentam,
  title     = {LatentAM: Real-Time, Large-Scale Latent Gaussian Attention Mapping via Online Dictionary Learning},
  author    = {Lee, Junwoon and Tian, Yulun},
  journal   = {arXiv preprint arXiv:2602.12314},
  year      = {2026}
}

Under construction. More demos will be added soon.

Code will be released after acceptance.

LatentAM: Real-Time, Large-Scale Latent Gaussian Attention Mapping via Online Dictionary Learning

Abstract

Proposed Method

Online Mapping

TUM - freiburg1_desk

TUM - freiburg3_long_office_household

Replica - Room0

Replica - Room2

Open-vocabulary 3D segmentation

Model-agnostic 3D Latent Gaussian

BibTeX