Cascades are a classical strategy to enable inference cost to vary adapt...
Prompt-tuning is an emerging strategy to adapt large language models (LL...
The impressive generalization performance of modern neural networks is
a...
Despite the popularity and efficacy of knowledge distillation, there is
...
Large neural models (such as Transformers) achieve state-of-the-art
perf...
Large language models (LLMs) have led to a series of breakthroughs in na...
This paper studies the curious phenomenon for machine learning models wi...
Many modern high-performing machine learning models such as GPT-3 primar...
We revisit the problem of learning mixtures of spherical Gaussians. Give...
Long-tail learning is the problem of learning under skewed label
distrib...
In classical federated learning, the clients contribute to the overall
t...
Scaling neural networks to "large" sizes, with billions of parameters, h...
Negative sampling schemes enable efficient training given a large number...
Distillation is the technique of training a "student" model based on exa...
Standard training techniques for neural networks involve multiple source...
Large Transformer models have achieved impressive performance in many na...
Transformer networks use pairwise attention to compute contextual embedd...
Knowledge distillation is a technique for improving the performance of a...
Modern retrieval problems are characterised by training sets with potent...
We consider learning a multi-class classification model in the federated...
Recently, there has been a surge of interest in representation learning ...
In this paper, we present distributed generalized clustering algorithms ...
Attention based Transformer architecture has enabled significant advance...
Many performance critical systems today must rely on performance
enhance...
Despite the widespread adoption of Transformer models for NLP tasks, the...
The computational cost of training with softmax cross entropy loss grows...
In the problem of structured signal recovery from high-dimensional linea...
This paper considers the problem of implementing large-scale gradient de...
Rectified linear units, or ReLUs, have become the preferred activation
f...
We study the problem of recovering a structured signal x_0 from
high-dim...
This paper addresses the problem of constructing MDS codes that enable e...