BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection
Single frame data contains finite information which limits the performance of the existing vision-based multi-camera 3D object detection paradigms. For fundamentally pushing the performance boundary in this area, BEVDet4D is proposed to lift the scalable BEVDet paradigm from the spatial-only 3D space to the spatial-temporal 4D space. We upgrade the framework with a few modifications just for fusing the feature from the previous frame with the corresponding one in the current frame. In this way, with negligible extra computing budget, we enable the algorithm to access the temporal cues by querying and comparing the two candidate features. Beyond this, we also simplify the velocity learning task by removing the factors of ego-motion and time, which equips BEVDet4D with robust generalization performance and reduces the velocity error by 52.8 time, become comparable with those relied on LiDAR or radar in this aspect. On challenge benchmark nuScenes, we report a new record of 51.5 high-performance configuration dubbed BEVDet4D-Base, which surpasses the previous leading method BEVDet by +4.3
READ FULL TEXT