Exploiting Spatial Sparsity for Event Cameras with Visual Transformers
Event cameras report local changes of brightness through an asynchronous stream of output events. Events are spatially sparse at pixel locations with little brightness variation. We propose using a visual transformer (ViT) architecture to leverage its ability to process a variable-length input. The input to the ViT consists of events that are accumulated into time bins and spatially separated into non-overlapping sub-regions called patches. Patches are selected when the number of nonzero pixel locations within a sub-region is above a threshold. We show that by fine-tuning a ViT model on the selected active patches, we can reduce the average number of patches fed into the backbone during the inference by at least 50 the classification accuracy on the N-Caltech101 dataset. This reduction translates into a decrease of 51 an increase of 46
READ FULL TEXT