RetVec: Resilient and Efficient Text Vectorizer

02/18/2023
by   Elie Bursztein, et al.
0

This paper describes RetVec, a resilient multilingual embedding scheme designed for neural-based text processing, including small-text classification and large-language models. RetVec combines a novel character encoding with an optional small model to embed words into a 256-dimensional vector space. These embeddings enable training competitive multilingual text models resilient to typos and adversarial attacks. In this paper, we evaluate and compare RetVec to state-of-the-art tokenizers and word embeddings on common model architectures. These comparisons demonstrate that RetVec leads to competitive models that are significantly more resilient to text perturbations across a variety of common tasks. RetVec is available under Apache 2 license at <https://github.com/[anonymized]>.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset