It is notoriously difficult to train Transformers on small datasets;
typ...
Neural network weights are typically initialized at random from univaria...
Although convolutional networks have been the dominant architecture for
...
Recent work has highlighted several advantages of enforcing orthogonalit...