Revisiting Retentive Networks for Fast Range-View 3D LiDAR Semantic Segmentation
Abstract
LiDAR semantic segmentation is a crucial task in autonomous driving and robotics, where real-time performance is essential for online decision-making. Recent trends exploit range images and Vision Transformers, using the self-attention mechanism. However, these approaches often lack explicit spatial priors and involve a large number of parameters. To tackle these limitations, we propose a novel method, adapting the Retentive Network architecture from the Natural Language Processing (NLP) field, for its efficient sequence modeling capabilities, directly operating on the range-view representation. Our approach incorporates a circular retention (CiR) mechanism that explicitly captures spatial relationships and continual circular property of the range image while modeling long-range dependencies and preserving the receptive field. In addition, we introduce a new set of range-view augmentations, adapted from 3D techniques, to improve generalization and mitigate class imbalance. Extensive experiments on three large-scale datasets, as SemanticKITTI, PandaSet and SemanticPOSS demonstrate that our method achieve state-of-the-art performance among range-view approaches on two out of three datasets, while satisfying real-time constraints. The code of our method is available at [REMOVED DUE TO ANONYMOUS SUBMISSION].