Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance
Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular for improved prediction and efficient use of limited resources. Fine-tuning requires identification of best models to transfer-learn from and quantifying transferability prevents expensive re-training on all of the candidate models/tasks pairs. We show that the statistical problems with covariance estimation drive the poor performance of H-score [Bao et al., 2019] – a common baseline for newer metrics – and propose shrinkage-based estimator. This results in up to 80 H-score correlation performance, making it competitive with the state-of-the-art LogME measure by You et al. [2021]. Our shrinkage-based H-score is 3-55 times faster to compute compared to LogME. Additionally, we look into a less common setting of target (as opposed to source) task selection. We identify previously overlooked problems in such settings with different number of labels, class-imbalance ratios etc. for some recent metrics e.g., LEEP [Nguyen et al., 2020] that resulted in them being misrepresented as leading measures. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings. We also outline the difficulties of comparing feature-dependent metrics, both supervised (e.g. H-score) and unsupervised measures (e.g., Maximum Mean Discrepancy [Long et al., 2015]), across source models/layers with different feature embedding dimension. We show that dimensionality reduction methods allow for meaningful comparison across models and improved performance of some of these measures. We investigate performance of 14 different supervised and unsupervised metrics and demonstrate that even unsupervised metrics can identify the leading models for domain adaptation. We support our findings with 65,000 (fine-tuning trials) experiments.
READ FULL TEXT