SMILES-X: autonomous molecular compounds characterization for small datasets without descriptors

06/20/2019

∙

In materials science and related fields, small datasets (≪1000 samples) are common. Our ability to characterize compounds is therefore highly dependent on our theoretical and empirical knowledge, as our aptitude to develop efficient human-engineered descriptors, when it comes to use conventional machine learning algorithms to infer their physicochemical properties. Additionally, deep learning techniques are often neglected in this case due to the common acceptance that a lot of data samples are needed. In this article, we tackle the data scarcity of molecular compounds paired to their physicochemical properties, and the difficulty to develop novel task-specific descriptors, by proposing the SMILES-X. The SMILES-X is an autonomous pipeline for the characterization of molecular compounds based on a {Embed-Encode-Attend-Predict} neural architecture processing textual data, a data-specific Bayesian optimization of its hyper-parameters, and an augmentation of small datasets naturally coming from the non-canonical SMILES format. The SMILES-X shows new state-of-the-art results in the inference of aqueous solubility (RMSE_test≃ 0.57 ± 0.07 mols/L), hydration free energy (RMSE_test≃ 0.81 ± 0.22 kcal/mol, ∼ 24.5% better than from molecular dynamics simulations), and octanol/water distribution coefficient (RMSE_test≃ 0.59 ± 0.02 for LogD at pH 7.4) of molecular compounds. The SMILES-X is intended to become an important asset in the toolkit of materials scientists and chemists for autonomously characterizing molecular compounds, and for improving a task-specific knowledge through hereby proposed interpretations of the outcomes. The source code for the SMILES-X is available at https://github.com/GLambard/SMILES-Xgithub.com/GLambard/SMILES-X.

READ FULL TEXT

SMILES-X: autonomous molecular compounds characterization for small datasets without descriptors

Molecular geometric deep learning

One Transformer Can Understand Both 2D 3D Molecular Data

FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction

Independent Vector Analysis for Data Fusion Prior to Molecular Property Prediction with Machine Learning

Bayesian Modeling of Spatial Molecular Profiling Data via Gaussian Process

Shift-Robust Molecular Relational Learning with Causal Substructure

SMILES-X: autonomous molecular compounds characterization for small datasets without descriptors

Related Research

Molecular geometric deep learning

One Transformer Can Understand Both 2D 3D Molecular Data

FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction

Independent Vector Analysis for Data Fusion Prior to Molecular Property Prediction with Machine Learning

Bayesian Modeling of Spatial Molecular Profiling Data via Gaussian Process

Shift-Robust Molecular Relational Learning with Causal Substructure