Learning Multi-frame Visual Representation for Joint Detection and Tracking of Small Objects
Deep convolutional and recurrent neural networks have delivered significant advancements in object detection and tracking. However, current models handle detection and tracking through separate networks, and deep-learning-based joint detection and tracking has not yet been explored despite its potential benefits to both tasks. In this study, we present an integrated neural model called the Recurrent Correlational Network for joint detection and tracking, where the two tasks are performed over multi-frame representation learned through a single, trainable, and end-to-end network. Detection is benefited by the tracker because of the stabilized trajectories and tracking is aided by the enhanced representation afforded by the training of the detector. We show that recently developed convolutional long short-term memory networks can learn multi-frame, multi-task representation, which is useful for both tasks. In experiments, we tackled the detection of small flying objects, such as birds and unmanned aerial vehicles, that can be challenging for single-frame-based detectors. We found that there was consistent improvement in detection performance by the proposed model in comparison with deep single-frame detectors and currently used motion-based detectors.
READ FULL TEXT