Sanskrit Sandhi Splitting using seq2(seq)^2
In Sanskrit, small words (morphemes) are combined through a morphophonological process called Sandhi to form compound words. Sandhi splitting is the process of splitting a given compound word into its constituent morphemes. Although rules governing the splitting of words exist, it is highly challenging to identify the location of the splits in a compound word, as the same compound word might be broken down in multiple ways to provide syntactically correct splits. where the split(s) occur, as the same compound word might be broken down in multiple ways to provide partly correct splits. Existing systems explore incorporating these pre-defined splitting rules, but have low accuracy since they don't address the fundamental problem of identifying the split location. With this work, we propose a novel Double Decoder RNN (DD-RNN) architecture which i) predicts the location of the split(s) with an accuracy of 95% and ii) predicts the constituent words (i.e. learning the Sandhi splitting rules) with an accuracy of 79.5%. To the best of our knowledge, deep learning techniques have never been applied to the Sandhi splitting problem before. We further demonstrate that our model out-performs the previous state-of-the-art significantly.
READ FULL TEXT