MiKASA:


Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding

CVPR 2024


Abstract

3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries particularly with descriptions that involve multiple anchors or are view-dependent. In response we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore MiKASA improves the explainability of decision-making facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets particularly excelling by a large margin in categories that require viewpoint-dependent descriptions.


Method



Architecture of our 3D Visual Grounding Model, which includes four main modules: a text encoder (Bert), a vision module with a scene-aware object encoder, a spatial module that fuses spatial and textual data, and a multi-layered fusion module. The fusion module combines text, spatial, and object features, employing a dual-scoring system for enhanced object category identification and spatial-language assessment.



Example



Visual representation of the model's decision-making process in diverse situations. Rows, from top to bottom, depict: (1) Choices determined by category score, (2) Choices determined by spatial score, (3) Our model's final selection after combining both scores, and (4) The established ground truth. Columns from left to right showcase varying scenarios. The green bounding box refers to the chosen object, and the red bounding box refers to the unchosen distractors.



Result



Our result compared with existing works.


BibTeX

@inproceedings{chang2024mikasa,
      title={MiKASA: Multi-Key-Anchor \& Scene-Aware Transformer for 3D Visual Grounding},
      author={Chang, Chun-Peng and Wang, Shaoxiang and Pagani, Alain and Stricker, Didier},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      pages={14131--14140},
      year={2024}
    }

Acknowledgements

This research has been partially funded by EU project FLUENTLY (GA: Nr 101058680) and the BMBF project SocialWear (01IW20002). The project is built based on the following repository: ReferIt3D, MVT for their excellent work!