Project Details
Projekt Print View

Exploring Chemical Compound Space with Machine Learning

Subject Area Theoretical Computer Science
Theoretical Chemistry: Electronic Structure, Dynamics, Simulation
Term from 2014 to 2017
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 253375148
 
Final Report Year 2019

Final Report Abstract

The general objective of this project was enable rational exploration of chemical compound space (CCS) by developing efficient and accurate ML models. The first important step was to assess the capabilities and limitations of machine learning (ML) techniques for an accurate prediction of molecular energies in CCS. The ML models were trained on a large set of reference energies computed with hybrid density-functional theory (DFT) including van der Waals interactions, as well as the quantum-chemical “gold standard” CCSD(T) method that represents the best possible reference that is still computationally feasible. As originally planed, the project developed in four intertwined directions: Generation of a large set of reference molecular energies; Developing a ladder of physical models (descriptors) for organic molecules, from classical charge repulsion to approximate electronic models for use as input to the ML model; Application and analysis of efficient ML models (kernel-based learning such as Support Vector Machines and Gaussian processes, neural networks, etc.); Physical analysis (exploration) of the chemical compound space using optimal ML models. This will be done both from the point of view of computational complexity (dimensionality, sparsity etc.), and also in terms of the underlying chemistry (for example, one question is whether one can identify classes of molecules in CCS). Our work has led to a number of developments in the areas of data-driven representations of physical systems, advances in incorporating prior knowledge of the application domain, the development of a hierarchy of molecular and material descriptors, as well as a consolidated understanding of the demand on statistical models in atomistic simulations. A novel and challenging aspect was that we allowed variations both in chemical composition and in configurational degrees of freedom (bonding and geometry). This required the extensions of traditional ML models and novel scalable and physically meaningful representations, which required a joint effort between physics, chemistry, and computer science. In addition to modeling atomic interactions, we also aimed at fostering the understanding of ML based potentials with the development of interpretable models. Our analysis revealed that the most effective statistical inference methods are able to recover and exercise chemical concepts in a fully data-driven way.

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung