Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models
Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating various textual inputs into numerical representations, thereby capturing the semantic essence of the text. The models excel in applications such as dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets. It underlines the crucial role of data cleaning in dataset preparation, gives in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Textual Embedding Benchmark (MTEB). To increase the model's awareness of negations, we constructed a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.
READ FULL TEXT