Quantitative structure–retention relationship model for predicting retention indices of constituents of essential oils of Thymus vulgaris (Lamiaceae)

In this paper, a quantitative structure–retention relationship (QSRR) model was developed for predicting the retention indices (log RI) of 36 constituents of essential oils. First, the chemical structure of each compound was sketched using HyperChem software. Then, molecular descriptors covering different information of molecular structures were calculated by Dragon software. The results illustrated that linear techniques, such as multiple linear regression (MLR), combined with a successful variable selection procedure are capable of generating an efficient QSRR model for predicting the retention indices of different compounds. This model, with high statistical significance (R2 = 0.9781, QLOO = 0.9691, Qext = 0.9546, QL(5)O = 0.9667, F = 245.27), could be used adequately for the prediction and description of the retention indices of other essential oil compounds. The reliability of the proposed model was further illustrated using various evaluation techniques: leave-5-out cross-validation, bootstrap, randomization test and validation through the test set.


INTRODUCTION
Essential oils, a new approach to prevent the proliferation of microorganism or protection of food from oxidation, are ubiquitously used as antibacterial, 1-3 antifungal 3,4 and antioxidant agents. 5They are also used to control human diseases of microbial origin and to cure diseases such as atherosclerosis and cancer 6 and are widely used in the food industry, medicine, and the fragrance industry.However, essential oils may exert toxic effects, such as human carcinogenicity, reproductive and developmental toxicity, neurotoxicity, and acute toxicity.The constituents present in essential oils have the potential to PREDICTION OF THE RETENTION INDICES 407 Quantitative and qualitative characterizations of the essential oils and volatiles from different parts of Artemisia tschernieviana plant were the main objectives of the works reported by Zanousi et al. 17 Additionally, a straightforward model was developed to model the RIs of diverse natural compounds present in the obtained essential oils and volatile fractions based on a simple stepwise multiple linear regression (SW-MLR) approach.
A robust screening approach and a sparse QSRR model for predicting the RIs of 169 constituents of essential oils, obtained from Conforti et al., 18 was reported by Al-Fakih et al. 19 The proposed method consists of two basic steps.The first step is dimension reduction, in which data are reduced from high dimensional space to a lower d-dimensional descriptor space.The second step is prediction, in which the response variable is predicted using a sparse method on the screened descriptors.
In the particular case of essential oils, despite the vast literature on the identification and characterization of novel constituent compounds, and apart from the study presented by Marrero-Ponce et al. 20 involving a database of 791 essential oil components with corresponding gas chromatographic retention properties, which seems to be an exception, the majority of QSRR models have generally been built on data sets of sizes ranging from 25 to 169 compounds, with most of these belonging to a single chemical series.
The aim of the present study was to develop a QSRR model for RI prediction of 36 essential oil components, representing chemical variation of leaf essential oil, at different stages of Thymus vulgaris (Lamiaceae) growth of Iranian plants.

Experimental data
The essential oil composition was analyzed using a Shimadzu QP 5050 GC/MS with DB-5 capillary column. 21n-Alkanes mixtures were analyzed under the GC/MS temperature condition program to calculate the Kovats RIs of the thirty-six identified components in the oil and are listed in Table I.The data are presented as the logarithm of RI to reduce the range of variation.

Descriptors generation
The chemical structure of each compound was sketched on a PC using the HyperChem program 22 and pre-optimized using the MM + molecular mechanics method (Polack-Ribiere algorithm).The final geometries of the minimum energy conformation were obtained by the semi empirical PM3 method at the restricted Hartree-Fock level with no configuration interaction applying a gradient norm limit of 0.01 kcal* Ǻ -1 mol -1 as a stopping criterion.
The resulting geometries were used as the input for the generation of 74 three-dimensional-geometrical descriptors using the Dragon software (version 6). 23These are the molecular descriptors defined in several different ways but always derived from the three--dimension structure of the molecules.

Training set and test set
The present challenge in the process of the development of a QSAR model lies in the development of a model with the capability to accurately predict the activity of new chemicals.
The most effective method of validating a regression model with respect to its prediction performance is to collect fresh data and directly compare the model predictions against them.When this is not possible, a reasonable procedure is to split the available data into two parts: a ________________________________________________________________________________________________________________________ Available on line at www.shd.org.rs/JSCS/(CC) 2019 SCS.

409
training set from which the model is built and an external set on which to evaluate its predictive power, the later should contain between 15 and 40 % of the compounds in the full data set. 24Several procedures can be adopted for the selection of the training and test sets.In this work, the Kennard and Stone algorithm (CADEX) 25 was used to arbitrarily split the whole data into 27 samples training set and 9 samples testing set (Table I).

MLR Modeling
The general purpose of multiple regressions is to quantify the relationship between several independent or predictor variables and a dependent variable.A set of coefficients defines the simple linear combination of independent variables (molecular descriptors) that best describes the retention indices.The value of the retention indices for each essential oil component would then be calculated as a composite of each molecular descriptor weighted by the respective coefficients.A multi-linear regression model with k regressors could be represented as: The term linear is used because Eq. ( 1) is a linear function of the unknown parameters β j , j = 0,1,…,k, called the regression coefficients.The parameter β j represents the expected change in the response y per unit change in x j when all of the remaining regressor variables x i (i ≠ j) are held constant.
It is more convenient to deal with multiple regression models if they are expressed in matrix notation.This allows a very compact display of the model data and results.In matrix notation, the model given by Eq. ( 1) is: In general, y is an n×1 vector of the observations, X is an n × p matrix of the levels of the regressor variables, β is a p×1 vector of the regression coefficients, and ε is an n×1 vector of random errors.When X is of full rank, the least-squares solution is b = (X T X) -1 X T y, where b is the estimator for the regression coefficients in b and X T is a transpose of X.
The vector of the fitted values ŷ i corresponding to the observed values y i is: The n×n matrix H = X(X T X) -1 X T is usually called the "hat" matrix because it maps the vector of the observed values into a vector of fitted values.The diagonal elements are often called the leverages, since examination of: indicates via h ii how heavily y i contributes to ŷ i .

Chemometric methods
Once the molecular descriptors are generated, multiple linear analysis regression and variable subset selection were performed by the software MobyDigs 26 using the ordinary least square (OLS) regression method.
From a statistical viewpoint, the ratio of the number of samples (n) to the number of descriptors (m) should not be too low.Usually, it is recommended that n/m ≥ 5. 27 First, models with 1-2 variables were developed by the all-subset-method procedure in order to explore all the low dimension combinations.The number of descriptors was sub-sequently increased one by one, thereby forming new models.The best models were selected at each rank, and the final model must be chosen from among them.
Population of 100 regression models were ranked according to their decreasing internal predictive performance, verified by Q 2 , and the optimal model was then selected.The goodness-of-fit of the calculated model was assessed by means of the multiple determination coefficient, R 2 , and the standard deviation error in calculation (SDEC): Cross-validation techniques allow an assessment of internal predictivity (Q 2 LMO cross--validation; bootstrap) in addition to the robustness of model (Q 2 LOO cross-validation).
Cross-validation methods consist in leaving out a given number of compounds from the training set and rebuilding the model, which is then used to predict the compounds left out.This procedure is repeated for all compounds of the training set, thereby obtaining a prediction for each one.If each compound is taken away one at a time, the cross-validation procedure is called the leave-one-out technique (LOO technique), otherwise it is a leave-many-out technique (LMO technique).An LOO or LMO correlation coefficient, generally indicated with Q 2 , is computed by evaluating the accuracy of these "test" compounds prediction.
The "hat" of the variable y, as is the usual statistical notation, indicates that it is a predicted value of the studied property and the sub index "i/i" indicates that the predicted values come from models built without the predicted compound.TSS is the total sum of squares.
The predictive residual sum of squares (PRESS) measures the dispersion of the predicted values.It is used to define Q 2 and the standard deviation error in prediction (SDEP): A value Q 2 > 0.5 is generally regarded as a good result and Q 2 > 0.9 as excellent. 28,29However, studies 30,31 have indicated that while Q 2 is a necessary condition for high predictive power of a model, it is not sufficient.To avoid overestimating the predictive power of the model, the LMO procedure (repeated 5000 times, with 5 objects left out at each step) was also performed ( 2 L(5)O

Q
).In the bootstrap validation technique, K n-dimensional groups are generated by a randomly repeated selection of n objects from the original data set.The model obtained on the first selected objects is used to predict the values for the excluded sample, and then Q 2 is calculated for each model.The bootstrapping was repeated 8000 times for each validated model. 32By using the selected model, the values of the response for the test objects were calculated and the quality of these predictions is defined in terms of Q where n ext and n tr are the number of objects in the external set (or left out by bootstrap) and the number of training set objects, respectively.Another useful parameter is the external standard deviation error of prediction (SDEP ext ), defined as: where the sum runs over the test set objects (n ext ).

Applicability domain analysis
The applicability domain (AD) 29,33 is a theoretical region in space defined by the descriptors of the model and the modeled response, for which a given QSRR should make reliable predictions.In this work, the structural AD was verified by the leverage approach.The leverage, h ii , 32 is defined as follows: (10)   where x i is the descriptor row vector of the i th compound, x T i is the transpose of x i , X is the model matrix derived from the calibration set descriptor values.
The warning leverage, h* is, generally, fixed at 3(m+1)/n, where n is the total number of samples in the training set and m is the number of descriptors involved in the correlation.

RESULTS AND DISCUSSION
A major step in the construction of QSRR model is finding a set of molecular descriptors that represent the variation in the structural properties of the molecules.The modeling and prediction of the physicochemical properties of organic compounds is an important objective in many scientific fields 13 .MLR is one of the most modeling methods in QSRR.MLR method provides equation linking the structural features to the (log RI) of the compounds.
.9592; SDEC = 0.011; SDEP = 0.013; SDEP ext = 0.016; s = 0.013 log unit; F = 245.27(p = 0.000) From the above equation, it could be concluded that the most significant descriptors (Table S-I of the Supplementary material) according to the MLR method are molecular multiple path count of order 02 (piPC02), average Randic-like index from the Burden matrix weighted by polarizability (ChiA_B(p)), spectral moment of order 2 from the Burden matrix weighted by I-state (SM2_B(s)) and 3D-MoRSE -signal 15/unweighted (Mor15u).A detailed description of the linear model based on compounds in the training set is summarized in Table II.The multi-collinearities between the above four descriptors were detected by their variation inflation factors (VIF), which can be calculated as follows: where r is the correlation coefficient of the multiple regression between the variables in the model.If VIF equals 1, then no inter-correlation exists for each variable; if VIF falls into the range of 1-5, the related model is acceptable and if VIF is larger than 10, the related model is unstable and a recheck is necessary.
The corresponding VIF values of the four descriptors are given in Table II.As can be seen from this table, the variables had VIF values of less than 5, indicating that the obtained model has statistical significance. 8he results for the randomized models can be compared with the real starting one only by representation in a plot of the statistical coefficients R 2 and Q 2 .This is depicted in Fig. 1.The statistics for the modified log RI vectors are clearly lower than the real QSRR model.This ensures that a real structureproperty relationship has been found.
The value of R 2 = 0.9781 attests the good fitting performances of the model.In general, the larger the magnitude of the F ratio, the better the model predicts the property values in the training set.The large F ratio of 245.27 indicates that the model performs excellently in predicting the log RI values.The model is robust, the difference between R 2 and Q 2 is small (< 0.5 %).
Plots of cross-validation of log RI values vs. experimental log RI values are shown in Fig. 2. Obviously, there is a close agreement between the experimental and predicted log RI values and the data present very low scattering around a straight line with respective slope and intercept values close to one and zero, respectively.SDEP is similar to SDEC and hence, this model has internal predictivity not very dissimilar from the fitting power.The model demonstrates a very good stability in internal validation (the difference between Q 2 and Q 2 L(5)O is 0.0024 %), while bootstrapping confirms the internal predictivity and stability of the model.
The information obtained by Q 2 ext is somewhat optimistic.In fact, with small data sets (20-40 compounds), completely new compound external predictivity could only be verified a posteriori, case-by-case.
The AD of the linear model was analyzed in a Williams plot (Fig. 3), the plot of standardized residuals (e istd ) vs. leverage values.Table I (column 7) shows samples with absolute standardized residual values smaller than 3 standard deviation units, and so, no Y-outlier was detected.

CONCLUSIONS
A QSRR study was performed for predicting the gas chromatographic RI of 36 essential oil compounds on a DB-5 stationary phase with low polarity, in which case the chromatographic retention is governed by the geometry and the size of the solute molecules. 34Hence, only geometrical descriptors were considered as explicative variables.
The optimal model is a multivariate linear model, which has four variables, selected using the MLR technique.A detailed validation procedure demonstrated the accuracy and robustness of the proposed model not only by calculating its fitness on the training set, but also by testing its predictive ability.The QSRR model with simply calculated molecular descriptors could be employed to estimate the retention index for new compounds.

Fig. 3 .
Fig. 3.The Williams plot of the data.

TABLE I
a Test compounds 2ext , which is defined as: ________________________________________________________________________________________________________________________ Available on line at www.shd.org.rs/JSCS/(CC) 2019 SCS.

TABLE II .
Selected descriptors of multiple linear regression