Character 3-gram Mover's Distance: An Effective Method for Detecting Near-duplicate Japanese-language Recipes
In websites that collect user-generated recipes, recipes are often posted that have a major component, such as the cooking instructions, that is very similar to those in other recipes. We refer to such recipes as "near-duplicate recipes". In this study, we propose a method that extends the "Word Mover's Distance", which calculates distances between texts based on word embedding, to character 3-gram embedding. Using a corpus of over 1.21 million recipes, we learned the word embedding and the character 3-gram embedding by using a Skip-Gram model with negative sampling and fastText to extract candidate pairs of near-duplicate recipes. We then annotated these candidates and evaluated the proposed method against a comparison method. Our results demonstrated that near-duplicate recipes that were not detected by the comparison method were successfully detected by the proposed method.
READ FULL TEXT