Character 3-gram Mover's Distance: An Effective Method for Detecting Near-duplicate Japanese-language Recipes

12/11/2019
by   Masaki Oguni, et al.
0

In websites that collect user-generated recipes, recipes are often posted that have a major component, such as the cooking instructions, that is very similar to those in other recipes. We refer to such recipes as "near-duplicate recipes". In this study, we propose a method that extends the "Word Mover's Distance", which calculates distances between texts based on word embedding, to character 3-gram embedding. Using a corpus of over 1.21 million recipes, we learned the word embedding and the character 3-gram embedding by using a Skip-Gram model with negative sampling and fastText to extract candidate pairs of near-duplicate recipes. We then annotated these candidates and evaluated the proposed method against a comparison method. Our results demonstrated that near-duplicate recipes that were not detected by the comparison method were successfully detected by the proposed method.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset