LatentAM: Real-Time, Large-Scale Latent Gaussian Attention Mapping via Online Dictionary Learning

Junwoon Lee    Yulun Tian   
Robotics Department, University of Michigan
Under Review

We present LatentAM, an online 3D Gaussian Splatting (3DGS) mapping framework that builds scalable latent feature maps from streaming RGB-D observations for open-vocabulary robotic perception. Instead of distilling high-dimensional Vision-Language Model (VLM) embeddings using model-specific decoders, LatentAM proposes an online dictionary learning approach that is both model-agnostic and pretraining-free, enabling plug-and-play integration with different VLMs at test time. Specifically, our approach associates each Gaussian primitive with a compact query vector that can be converted into approximate VLM embeddings using an attention mechanism with a learnable dictionary. The dictionary is initialized efficiently from streaming observations and optimized online to adapt to evolving scene semantics under trust-region regularization. To scale to long trajectories and large environments, we further propose an efficient map management strategy based on voxel hashing, where optimization is restricted to an active local map on the GPU, while the global map is stored and indexed on the CPU to maintain bounded GPU memory usage. Experiments on public benchmarks and a large-scale custom dataset demonstrate that LatentAM attains significantly better feature reconstruction fidelity compared to state-of-the-art methods, while achieving near-real-time speed (12-35 FPS) on the evaluated datasets.

Proposed method overview
  • Model-agnostic, pretraining-free, real-time latent feature mapping: LatentAM represents each 3D Gaussian with a compact query vector and reconstructs high-dimensional VLM embeddings via Gaussian memory attention over a learnable dictionary, enabling plug-and-play use with different VLMs (e.g., CLIP / DINOv3 / LSeg).
  • Online dictionary learning: The dictionary is initialized and updated online using streaming K-means and historical learnings, then trained with trust-region regularization to reduce overfitting / catastrophic forgetting over long trajectories.
  • Scalable local-global map management: LatentAM scales to large environments using voxel hashing and bounded GPU memory by optimizing only an active local map on the GPU while storing/indexing the global map on the CPU, achieving near-real-time performance and long-term, large-scale mapping.

Online Mapping

TUM - freiburg1_desk

Raw RGB Online Rendered Feature Query : Book Query : Keyboard Query : Game Pad Query : Telephone

TUM - freiburg3_long_office_household

Raw RGB Rendered Feature Query : Bottle Query : Book Query : Keyboard Query : Teddy Bear Query : Chair

Replica - Room0

Raw RGB Online Rendered Feature Query : Sofa Query : Table Query : Vase

Replica - Room2

Raw RGB Online Rendered Feature Query : Chair Query : Table Query : Vase

Open-vocabulary 3D segmentation

RGB Feature-3DGS M3 LatentAM
RGB Feature-3DGS (Online) M3 (Online) LatentAM (ours)

*For Feature-3DGS and M3, we run online using the same keyframe selected by the proposed method.

Model-agnostic 3D Latent Gaussian

TUM3 LSeg TUM3 CLIP TUM3 DINOv3
LSeg 512dim CLIP 768dim DINOv3 1024dim
Custom LSeg Custom CLIP Custom DINOv3
LSeg 512dim CLIP 768dim DINOv3 1024dim

BibTeX

@article{lee2026latentam,
  title     = {LatentAM: Real-Time, Large-Scale Latent Gaussian Attention Mapping via Online Dictionary Learning},
  author    = {Lee, Junwoon and Tian, Yulun},
  journal   = {arXiv preprint arXiv:2602.12314},
  year      = {2026}
}
Under construction. More demos will be added soon.
Code will be released after acceptance.