Comprehensive, explainable, ml-based molecular toxicity and protein-ligand binding predictions

Thumbnail Image
Sharma, Bhanushee
Issue Date
Electronic thesis
Chemical engineering
Research Projects
Organizational Units
Journal Issue
Alternative Title
Machine Learning (ML) and computer aided drug design have widely been considered to accelerate and focus the time-consuming and costly drug discovery process. In this work, we have used ML tools to improve early-stage drug discovery processes by creating a comprehensive and explainable framework in predicting clinical toxicity of molecules and characterizing structural basis of protein-ligand interactions. Explainable machine learning for molecular toxicity prediction is a promising approach for efficient drug development and chemical safety. A predictive ML model of toxicity can reduce experimental cost and time while mitigating ethical concerns by significantly reducing animal and clinical testing. Herein, we used a deep learning framework for simultaneously modeling in vitro, in vivo, and clinical toxicity data. Two different molecular input representations were used; Morgan fingerprints and pre-trained SMILES embeddings. A multi-task deep learning model accurately predicted toxicity for all endpoints, including clinical, as indicated by the area under the Receiver Operator Characteristic curve and balanced accuracy. In particular, pre-trained molecular SMILES embeddings as input to the multi-task model improved clinical toxicity predictions compared to existing models in MoleculeNet benchmark. Additionally, our multitask approach is comprehensive in the sense that it is comparable to state-of-the-art approaches for specific endpoints in in vitro, in vivo and clinical platforms. Through both the multi-task model and transfer learning, we were able to indicate the minimal need of in vivo data for clinical toxicity predictions. To provide confidence and explain the model’s predictions, we adapted a post-hoc contrastive explanation method that returns pertinent positive and negative features, which correspond well to known mutagenic and reactive toxicophores, such as unsubstituted bonded heteroatoms, aromatic amines, and Michael receptors. Furthermore, toxicophore recovery by pertinent feature analysis captures more of the in vitro (53%) and in vivo (56%), rather than of the clinical (8%), endpoints, and indeed uncovered a preference in known toxicophore data towards in vitro and in vivo experimental data. To our knowledge, this is the first contrastive explanation, using both present and absent substructures, for predictions of clinical and in vivo molecular toxicity. Another nontrivial aspect of the drug discovery process is understanding and characterizing the structural basis of protein-ligand interactions, crucial for developing de novo therapeutics for a protein target. Traditionally, structure-based methods have made tremendous progress over the years, focusing on docking of ligand-protein complexes, adopting classical force-field, empirical or knowledge-based approaches However, these methods rely on the availability of the 3D structure of the given target, and are time consuming and computationally expensive. Machine Learning methods have more recently been applied to predict binding affinity by using existing experimental or computational data. Though these “black-box” models often produce high accuracy predictions they do not provide a human-understandable structural reason for a given prediction. In this work, we created a novel framework to explain the predicted binding of a molecule by a black-box deep neural network model. The framework builds upon the contrastive explanations method (CEM) and provides explanations that go beyond mere correlations identifying minimally sufficient substructures (pertinent positives) that recover the prediction of the black-box, as well as minimal additions (pertinent negatives) that would be necessary to alter its decision. We applied our framework to explain the binding of computationally generated small drug-like small molecules designed by a deep learning model, to three SARS-CoV-2 therapeutic targets, namely the non-structural protein 9 replicase (NSP9), main protease (MPRO) and receptor binding domain of the spike protein (chimeric RBD), which to the best of our knowledge is novel. Pertinent substructures obtained from 1D molecular representations were in agreement with 2D and 3D docking interactions and known pharmacophores. We believe this approach will provide confidence and molecular understanding to high performing ML models predicting protein-ligand interactions, while indicating pertinent substructures to add or avoid in designing ligands. Thus, in this work we have leveraged ML-based approaches to help accelerate different aspects of the drug discovery process. We have provided improved and explainable predictions of clinical toxicity of molecules, and a low computational, explainable, approach to predict structural basis of protein-ligand binding.
August 2022
School of Engineering
Full Citation
Rensselaer Polytechnic Institute, Troy, NY
Terms of Use
PubMed ID