CINE study makes machine learning more accurate in predicting material properties
Machine learning computer programs stand out from all others for having the ability to learn from experience, that is, from interaction with a set of data. The greater the experience, the better these programs or models should perform in the task for which they were created. However, they don’t always work perfectly. Errors do happen, and being able to detect and resolve them is essential.
A team of researchers from CINE analyzed the errors of a machine learning model created to predict physicochemical properties of a group of materials. “The results we presented may make the use of machine learning methods in Materials Science more assertive and less costly”, says Luis Cesar de Azevedo, one of the authors of the article reporting the study.
In fact, there is a growing interest in using machine learning tools to find materials or molecules that have desired properties so that they can efficiently fulfill certain functions in devices or systems. In the CINE’s Computational Materials Science and Chemistry (CMSC) program, studies on machine learning have been carried out in order to address the need to develop or find efficient materials for energy generation and storage.
To explore the practically infinite set of possible molecules, experimental methods are unthinkable and traditional computational methods are not enough, as they are relatively time-consuming and therefore expensive. To get an idea, while simulating a single molecule by a conventional method like Density Functional Theory can take a few days, analyzing tens of thousands of compounds by machine learning can take a few seconds.
For this, it is necessary to develop an algorithm (a set of computational instructions) and use a database previously obtained by the scientific community by experimental or theoretical methods – such as the one used in the CINE study, which gathers data from more than 133 thousand molecules. The algorithm must then do its training, by interacting with the data and recognizing patterns. The result will be a model that will be able to predict the properties of materials and molecules that were not included in the initial database.
“Although there are models with a high average accuracy in some domains, these models can make discrepant errors (outliers) for some molecules”, explains Luis César, who is a member of the CMSC at CINE. “This work demonstrated that a detailed view of the error, decomposing it into systematic (bias) and random (variance) errors, can show specific characteristics of the prediction performance”, he adds. The work also identified that most of these inaccuracies happen with planar molecules (those that have wider angles and greater distance between their atoms). Fortunately, the article showed that it is possible to reduce errors by using a combination of machine learning models (ensemble) to predict material properties. Furthermore, according to the authors, when preparing the algorithm training, it is necessary to carry out a more careful selection of data and descriptors (the computational values used to describe the molecules in the database).
This study was carried out within the Master’s in Computer Science that Luis Cesar is carrying out at UFABC under the guidance of Professor Ronaldo C. Prati (UFABC). The work had the collaboration of other members of CINE, professors Juarez L. F. Da Silva (IQSC-USP) and Marcos G. Quiles (UNIFESP), and doctoral candidate Gabriel A. Pinheiro (UNIFESP).
Paper reference: Systematic Investigation of Error Distribution in Machine Learning Algorithms Applied to the Quantum-Chemistry QM9 Data Set Using the Bias and Variance Decomposition. Luis Cesar de Azevedo, Gabriel A. Pinheiro, Marcos G. Quiles, Juarez L. F. Da Silva, and Ronaldo C. Prati. J.Chem. Model Info. 2021. https://doi.org/10.1021/acs.jcim.1c00503.
Authors of the paper who are members of CINE: Luis Cesar de Azevedo (master’s student at UFABC), Gabriel A. Pinheiro (doctoral student at UNIFESP), Marcos G. Quiles (professor at UNIFESP), Juarez L. F. Da Silva (professor at IQSC-USP) and Ronaldo C. Prati (professor at UFABC).