Modeling framework

An essential part of the OCHEM platform is the modeling framework. Its main purpose is to provide facilities for the development of predictive computational models for physicochemical and biological properties of compounds. The framework is integrated with the database of experimental data and includes all the necessary steps required to build a computational model: data preparation, calculation and filtering of molecular descriptors, application of machine learning methods and analysis of a models’ performance. This section gives an overview of these features and of the steps required to build a computational model in the OCHEM.

OCHEM modeling framework allows to perform the full cycle of QSAR model development, which includes:

Management of datasets with experimental data.
Users can create and manage reusable datasets referred to as baskets.
Calculation of molecular descriptors.
OCHEM supports more that 20 types of state-of-the-art molecular descriptors from different 3rd party vendors.
Running a machine learning method
Proper validation protocol of the model
Calculation of model statistics
Application of the model to new compounds
Recalculation of the model based on new experimental evidence

Concisely, the main features of the modeling framework within the OCHEM include:

Support of regression and classification models
Calculation of various molecular descriptors ranging from molecular fragments to quantum chemical descriptors. Both whole-molecule and per-atom descriptors are supported.
Tracking of each compound from the training and validation sets
Basic and detailed model statistics and evaluation of model performance on training and validation sets
Assessment of applicability domain of the models and their prediction accuracy
Pre-filtering of descriptors: manual (external) selection, de-correlation filter, Unsupervised Forward Selection (UFS)
Various machine learning methods including both linear and non-linear approaches
N-fold cross-validation and bagging validation of models
Multi-learning: models can predict several properties simultaneously
Combining data with different conditions of measurements and the data in different measurement units
Distribution of calculations to an internal cluster of Linux and Mac computers
Scalability and expendability for new descriptors and machine learning methods

The steps of a typical QSAR research in the OCHEM system and the corresponding features are summarized in a diagram in the following figure: