Root-aligned SMILES for Molecular Retrosynthesis Prediction
Retrosynthesis prediction is a fundamental problem in organic synthesis, where the task is to discover precursor molecules that can be used to synthesize a target molecule. A popular paradigm of existing computational retrosynthesis methods formulate retrosynthesis prediction as a sequence-to-sequence translation problem, where the typical SMILES representations are adopted for both reactants and products. However, the general-purpose SMILES neglects the characteristics of retrosynthesis that 1) the search space of the reactants is quite huge, and 2) the molecular graph topology is largely unaltered from products to reactants, resulting in the suboptimal performance of SMILES if straightforwardly applied. In this article, we propose the root-aligned SMILES (R-SMILES), which specifies a tightly aligned one-to-one mapping between the product and the reactant SMILES, to narrow the string representation discrepancy for more efficient retrosynthesis. As the minimum edit distance between the input and the output is significantly decreased with the proposed R-SMILES, the computational model is largely relieved from learning the complex syntax and dedicated to learning the chemical knowledge for retrosynthesis. We compare the proposed R-SMILES with various state-of-the-art baselines on different benchmarks and show that it significantly outperforms them all, demonstrating the superiority of the proposed method.
READ FULL TEXT