Dataset profile
The two available at OCHEM models predict Melting Point (MP) of organic chemical compounds. The MP is one of the important physico-chemical properties, which is frequently used in drug discovery to determine estimate aqueous solubility of chemical compounds. The data fro MP were collected
octanol/water partition coefficient (logP) and solubility in water (logS). Both these parameters are important important for drug discovery. The model is further development of ALOGPS 2.1 program [Tetko, I. V.; Tanchuk, V. Y. Application of associative neural networks for prediction of lipophilicity in ALOGPS 2.1 program, J. Chem. Inf. Comput. Sci., 2002, 42, 1136-45 and Tetko, et al Estimation of aqueous solubility of chemical compounds using E-state indices, J. Chem. Inf. Comput. Sci., 2001, 41, 1488-93] which is available at Virtual Computational Laboratory (VCCLAB) site. This program was assessed in several benchmarking studies and was top-ranked for prediction of in house Pfizer and Nycomed [Mannhold et al, Calculation of molecular lipophilicity: State-of-the-art and comparison of log P methods on more than 96,000 compounds. J Pharm Sci. 2009 Mar;98(3):861-93. doi: 10.1002/jps.21494.].
The data for logP and logS were taken from these two previous publications as well as were merged with those collected at OCHEM web site. The training sets included 16647 and 6778 compounds for logP and logS properties, respectively. The data were filtered from the outliers using an automatic p-value based filtering feature of OCHEM (article in preparation). Considering high inter-dependency of both properties, there were modeled simultaneously, using multi-learning feature of OCHEM [Varnek et al, Inductive transfer of knowledge: application of multi-task learning and feature net approaches to model tissue-air partition coefficients. J Chem Inf Model. 2009 Jan;49(1):133-44. doi: 10.1021/ci8002914] to increase the applicability domain of the modelscomplexity with prediction of this point are connected to purity of compounds, existence of polymorphic forms, degradation of compounds before melting, etc. All these factors influence the quality of models for this point. The data for MP were collected in OCHEM database as well as were provided by ChemExper database (OCHEM dataset) and Enamine Ltd (Enamine dataset) The majority of data were organic chemistry compounds. The models were validated using 277 compounds compiled by [Bergstrom et al Molecular descriptors influencing melting point and their role in classification of solid drugs. J. Chem. Inf. Comput. Sci. 2003; 43 (4) 1177-85] as well as data from Open Notebook.
Data preprocessing
All chemical structures were processed using OCHEM cleaning and standardization protocols. A specific care was used to eliminate salts and mixtures, and inorganic compound, which can could dramatically change MP of molecules. The detection and elimination of outliers was done based on p-value (article in preparation).
Descriptors
Models were built 11 individual descriptor packages available in OCHEM. A simple average of all 10 models was done to develop consensus model. This model, however, requires rather long calculations, especially if calculations of descriptors have not been previously cached. There is also 2D model, "Melting Point best (Estate)", which was built using EState descriptors (electrotopological EState indices) using program developed by Dr. Tanchuk, which was also used to develop ALOGPS 2.1 modelEstate descriptors. All other sub-models are not shown, just to avoid confusion with having too many of them in the web browser (they can be accessed using public IDs indicated in the profile of the consensus model as https://ochem.eu/model/MODEID).
Validation
The model was were built using 5-fold cross validation as well as prediction of subsets (e..g, model developed using OCHEM subset was used to predict Enamine, Bergstrom and Bradley sets, etc.)
Statistical parameters
Prediction accuracy
The basic prediction accuracy parameters according to the 5-fold cross-validation procedure (N=25547) are:
Property | |||||||
---|---|---|---|---|---|---|---|
# records | RMSE | MAE | R2 | r2 (Coefficient of determination) | |||
logP | 16912 | 0.42 | 0.30Consensus model | 37.1 | 27.6 | 0.9578 | 0.9578 |
logS | 8102 | 0.70 | 0.52Melting Point best (Estate) | 39.6 | 29.1 | 0.9075 | 0.9075 |
Applicability domain
The prediction accuracy is estimated using ASNN-STD. This distance to model was shown to provide the best assessment of the accuracy of prediction as described in [Tetko et al, Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection, J Chem Inf Model. 2008 Sep;48(9):1733-46accuracy for drug-like subset (molecules with melting point in [50,250]°C interval is less than 33°C for the consensus model.
Reference
The full details of the study are published in How accurately can we predict the melting points of drug-like compounds? [Tetko IV, Sushko Y, Novotarskyi S, Patiny L, Kondratov I, Petrenko AE, Charochkina L, Asiri AM. J Chem Inf Model. 2014 Dec 22;54(12):3320-9. doi: 10.1021/ci800151m].ci5005288.]
Availability
All data can be publicly downloaded at http://ochem.eu/article/55638.