SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners
Self-supervised Masked Autoencoders (MAE) are emerging as a new pre-training paradigm in computer vision. MAE learns semantics implicitly via reconstructing local patches, requiring thousands of pre-training epochs to achieve favorable performance. This paper incorporates explicit supervision, i.e., golden labels, into the MAE framework. The proposed Supervised MAE (SupMAE) only exploits a visible subset of image patches for classification, unlike the standard supervised pre-training where all image patches are used. SupMAE is efficient and can achieve comparable performance with MAE using only 30 evaluated on ImageNet with the ViT-B/16 model. Detailed ablation studies are conducted to verify the proposed components.
READ FULL TEXT