3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baseline.
Features
- Cutting-edge Baseline for Camera-based Detection
- In this work, the authors present a new framework termed BEVFormer
- BEVFormer exploits both spatial and temporal information
- The proposed approach achieves the new state-of-the-art 56.9% in terms of NDS metric on the nuScenes test set
- To aggregate spatial information, the authors design a spatial cross-attention that each BEV query extracts the spatial features
- On par with the performance of LiDAR-based baselines