BACKGROUND |
This section intends to introduce, in a concise way, the basic concepts required to apply GOLPE in real 3D-QSAR studies. Moreover we will introduce some new techniques included in this version that are not yet described elsewhere.
Once the User has red this section he should have no problem to follow the rest of the manual and to understand the meaning of the terms used in GOLPE. However, most of the techniques are not described in deep and we encourage the users to read some good texts. In particular, most users will find useful the bibliography section.
[Principal Component Analysis (PCA)][Partial Least Squares (PLS)][Variables Selection][Grouping of variables][Multiblock data][Bibliography]
1. Principal Components Analysis (PCA).
1.1. Theory.
Usually, in 3D-QSAR studies, the data file contains less than one hundred of objects and several thousands of X-variables. In fact, there are so many X-variables that by looking at them directly, no one can discover patterns, trends, clusters, etc... in the objects. The Principal Components Analysis (PCA) is a technique extremely useful to "summarize" all the information contained in the X-matrix and put it in a form understandable by human beings.
The PCA works by decomposing the X-matrix as the product of two smaller matrices:
X = TP' + E
The information not contained in these matrices remains as "unexplained X-variance" in a residual matrix (E) which has exactly the same dimensionality as the original X-matrix.
The PCs, among many others, have two interesting properties:
In PCA, the User can decide how many PCs should be extracted (the number of significant components, i.e. the dimensionality of the model). Each new PC extracted increases further the amount of information (variance) explained by the model. However, usually the first four of five PCs explain more than 90% of the X-variance. There is not a simple nor unique criterion to decide how many PC to extract and two kinds of considerations should be taken into account. From a theoretical point of view, it is possible to use cross-validation techniques to decide the number of PCs to include. From a practical point of view it does not matter to extract a large number of PCs if the User has no way to interpret the results.
1.2. Uses of PCA
PCA, in the context of 3D-QSAR is extremely useful for understanding the distribution of the objects, and knowing how different are one from another and why they are different. Moreover the PCA can be used to highlight the grid variables (locations around the molecules) which contain similar information, or the grid variables which contains completely independent information.
The best way to extract information from the PCA is to plot the matrices obtained:
2d and 3d scores plot
These plots represent the relative position of the objects in the space (two-dimensional or three-dimensional) of the principal components. They are useful to identify clusters of objects and single objects that behave in a peculiar way (outliers).
Moreover, the position of the objects in the plots may serve to interpret the PCs. The first PCs try to explain the maximum amount of variation and therefore, when there are clusters of objects, to distinguish among them. In this context, the PC can be interpreted as a compendium of the distinctive features of the objects in these clusters.
2d and 3d loading plots
These plots represents the original variables in the space (two-dimensional or three-dimensional) of the principal components.
Remember that the PC are obtained as linear combinations of the original X-variables. The loading of a single variable indicates how much this variable participates in defining the PC (the squares of the loadings indicate their percentage in the PC). Variables contributing very little to the PCs have small loading values and are plotted around the center of the plot. On the other hand, the variables which contribute most are plotted around the borders of the plot.
Loading plots allow to test the "homogeneity" of the contributions of the X-variables to the model. When there are more than one single field group of variables, it is useful to plot the loadings highlighting the variables belonging to one of them, in order to understand how each field contributes to the whole description of the objects.
Grid-plot of loadings
In the field of 3D-QSAR, the X-variables are obtained from the grid analysis of a set of molecules and therefore each variables defines a precise spatial position (grid location). Plotting the loadings in these positions allows the User to identify the areas in the space that contribute most to a certain PC. When the meaning of the PCs is understood, the grid plot highlights the areas around the molecules associated to this meaning. For instance, if a PC is known to distinguish between two kind of ligands, the grid-plot of the loadings of this PC shows the areas that contribute to the differences between the ligands.
2. Partial Least Squares (PLS)
2.1. Regression models.
In the previous section we have described PCA, a technique of multivariate analysis that deals only with the X-variables. The Partial Least Squares (PLS) analysis is a regression technique and its goal is to explain one or more dependent variables (Y's) in terms of a number of explanatory variables (predictors, X's).
Y = f(X) + E
It is possible to build many different models that fulfill the equation. Different methods produce models that "fit" the Y's more or less accurately. Among them, the best one will be able to calculate Y values that correspond to the experimental ones, even for molecules not included in building the model. These models are "predictive" and can be used to calculate reliable estimations of Y values for new molecules, prior to their availability.
It is important to notice that the Y's variables, like any other experimental variable, contain error. The models will try to fit the Y as much as possible and, if we try to improve too much the fitting, the model will explain also the noise!. This phenomenon, called "overfitting", is very dangerous, because overfitted models seem to be very good, but they often prove to be useless to predict the Y's of objects not included in the available data set (training set).
The predictivity of a model is attributed to the existence of a "true" relationship between the measured X properties (usually field interaction energies) and the Y property measurements (usually the activity of a drug). Even if the 3D-QSAR models are rough and are not able to explain to a full extent the activity values, they are extremely valuable to identify the areas around the molecules that contribute most to the activity. These areas can be regarded as areas important in the ligand-receptor interaction, and therefore they can give hints about how they interact and suggest new compounds.
The most sophisticated PLS 3D-QSAR models are subject to the same limitations of any regression model, and the aphorism "No regression model is better than the series it was obtained from" is always applicable. It is not possible that a model can provide information about the influence on the activity of areas which were not different enough in the series. Unfortunately no design method for 3D-QSAR have been reported so far and a good series design in this area continue to be a challenge. To a certain extent, the grouping of variables included in GOLPE may reveal the lack of design in the series (see section 5.4). Nevertheless the User should be aware of the lack of series design and understand the limitations of the resulting models.
2.2. PLS modeling.
As previously stated, in 3D-QSAR the X-matrix usually contains less than one hundred of objects and several thousands of variables. In this situations, the classical regression technique, Multiple Linear Regression (MLR) is completely useless. There are many reasons, but, among others:
In fact, the only regression method than can deal with the kind of X-matrices used in 3D-QSAR is Partial Least Squares (PLS). PLS works decomposing the X-matrix as the product of two smaller matrices, much like PCA does:
The main difference is that PCA obtains the PCs that represent at best the structure of the X-matrix and PLS obtains the LVs under two constrains:
The PLS algorithm originally proposed by Wold works as follows. For each component it is calculated:
w' = u'X / (u'u)
w = w / ||w|| (w is normalized to length 1)
t = Xw / (w'w)
c' = t'Y / (t't)
u = Yc / (c'c)
And these steps are repeated until t converges. Then it is calculated:
p' = t'X / (t't) (calculates the X-loadings)
E = X - tp' (updates the X-matrix)
F = Y - tc' (updates the Y-matrix)
Where (the lowercase letters represent the corresponding vectors from this matrices):
X X-matrix.
Y Y-matrix (can contain more than one single variables).
W Weight matrix which expresses the relationship between the X's and the Y's.
T X-score matrix, in which the objects are described in terms of the LVs.
C Y-weights, which describe the relationship between the X-scores and the Y-variables.
E The X-residuals. Variance that remains unexplained in the X-matrix.
F The Y-residuals. Variance that remains unexplained in the Y-matrix.
P X-loading matrix, the LVs described in terms of the original X-variables.
At the end, the PLS model satisfies:
X = TW' + E X's are decomposed in X-scores (T) and X-weights (W)
Y = UC' + F Y's are decomposed in Y-scores (U) and Y-weights (C)
U = T + H X-scores correlates with Y-scores (the inner relation)
Notice that the X-loadings (P) are calculated at the end of each iteration, to best approximate the X-matrix, and don't participate directly in the model building but to update the X-matrix.
The LVs share some important properties with the PCs:
As in the PCA, the User can select the number of LVs to maintain in the model, but in PLS, selecting the correct dimensionality is of critical importance. When too many LVs are included a serious overfit will result and the model will have little or no validity. To check how many LVs to include it is strictly necessary to test the predictivity of the model taking into account different number of components.
2.3. Tests of predictivity. Cross-validation.
The evaluation of the predictivity of the PLS models is important to :
The predictivity of a model is usually evaluated using cross-validation (CV). It works building reduced models (models for which some of the objects were removed) and using them to predict the Y-variables of the objects held out. Then the Y predicted is compared with the Y experimental, and so, for each model dimensionality the following indexes are computed:
SDEP Standard Deviation of Errors of Prediction.
Q^2 Predictive correlation coefficient
Y : Experimental value
Y' : Predicted value
: Average value
N : Number of objects
The CV technique is very valuable because it performs an "internal validation" of the model and obtains an estimation of the predictivity without the help of external data sets. This is particularly important in QSAR studies, where the number of objects available is usually small, and it is not affordable to remove objects from the learning data set.
One of the main inconveniences of CV is that there is not a general agreement on how to build the reduced groups and on the criterion to decide how many objects to keep. It is clear that the objects should be deleted once and only once over the model ensemble, but apart from this there are different approaches:
Leave One Out Models are built keeping one object at a time out of the analysis and repeating the procedure until all the objects are kept out once.
Leave Two Out Models are built keeping two objects at a time out of the analysis and repeating the procedure until all the objects are kept out once in all the combinations of two.
Fixed Groups The objects are assigned in a fixed way to N groups, each one containing an equal (or nearly equal) number of objects. Then models are built keeping one of this groups out of the analysis until all the objects were kept one once.
Random Groups The objects are assigned in a random way to N groups, each one containing an equal (or nearly equal) number of objects. Then models are built keeping one of this groups out of the analysis until all the objects were kept one once. The formation of the groups and the validation is repeated M times.
In order to choose the most appropriate CV method we have to consider the peculiarities of the QSAR data sets, in which the objects are always clustered. In this circumstance, the LOO or LTO methods have no chance to remove the structure of the data; most of the information contained in the object or couple of objects removed is anyway inside the model, kept in others objects of the same cluster, thus leading to overoptimistic results if this is also true that the SDEP is obtained in a reproducible way. On the other hand, the random group approach can produce a much better, but more conservative, estimation of the real predictivity. In other words: the uncertainly of future predictions is numerically worse but much more reliable. The lower is the number of groups, the harder is the validation criterion.
In this latter approach, the number of groups should be fixed in such a way that there are real chance that complete clusters are removed from the analysis in the reduced models. Also, the procedure should be repeated many times, in order to obtain stable results. For this procedure it can be computed the Standard Deviation of SDEP, which gives an estimate of the dispersion of the SDEP values obtained from different runs.
There are also some more details of the computation that may affect the results of CV. When some objects are removed from the data set to build the reduced models, there are two possibilities: to recalculate the average and weights of each variable for the new, reduced data set or to use the original averages and weights. The first approach is more time consuming, but the calculations are more accurate. This method may introduce bias and the reduced model may require one more LV, but only if the groups are formed only once. When the groups are formed in a random way the problem is removed.
Another way to evaluate the predictivity of the model is to use an external prediction set. In this approach the objects in the original data set are split up into two groups from the very beginning of the analysis. The first one, the learning set, will be used to build the PLS model. The other, the prediction set, will be used to compare their experimental Y-values with the predictions made by the PLS model. There is no doubt that this technique is more realistic to test the predictivity. However it can be argued that the results depend critically upon how many and which objects are assigned to each group. Also, the data sets in QSAR, often contains too few objects and it is not possible to remove objects from the analysis without a loss of information.
2.4. Uses of PLS.
As for PCA, the best way to examine the information from PLS is to plot the matrices obtained. To fully understand and diagnose a PLS model one should look carefully all the available plots.
2d T-U scores plot
This plot represent objects in the space of X-scores (T) against the Y-scores (U). From this plot the User can have a clear idea of the correlation between the X's and the Y's obtained in the model for each one of the LV. The plot of the first LV is by far the most informative and contains the main relationship between activities and structural descriptors. This information is completely lost is one looks only to the "predicted vs experimental" plot.
The plot can be useful to identify influential objects or clusters of objects (outliers). Usually these objects don't correlate in the first component. Then, the second or third LV is devoted to fit them, and they appear in the T-U scores plot for this LV as some (few) objects completely distinct from the rest of the objects. Accordingly, the T-U score plots for non-significant components show this behavior.
2d and 3d loading plots
These plots represents original variables in the space (two-dimensional or three-dimensional) of the latent variables (P).
Remember that the LVs are obtained as linear combinations of the original X-variables. The loading of a single variable indicates how much of this variable is included in the LV.
Variables contributing very little to the LVs have small loading values and are plotted around the center of the plot. On the other hand, the variables which contribute most are plotted at the borders of the plot.
Loading plots allow to test the "homogeneity" of the contributions of the X-variables to the model. When there are more than one single field group of variables, it is useful to plot the loadings highlighting the variables in one of them, to understand how each field contributes to the whole PLS model.
2d and 3d weight plots
These plots represent original variables in the space (two-dimensional or three-dimensional) of the weights (W).
The weights (W) represent the coefficients that multiply the X's to best fit the Y's. Therefore, we can say that the loadings represent better the first constraint used to build the PLS model (the representation of the X-matrix) while the weights represent better the second constraint used to build the PLS model (the fitting of the Y's). Variables with high weights are important for the fitting of the Y's while variables with low weights (those in the center of the plot) are not so important. When there are more than one single field group of variables, it is useful to plot the weights highlighting the variables in one of them, to understand how each field contributes to the whole PLS model, from the point of view of the fitting.
Grid-plot of loadings and weights.
In the field of 3D-QSAR, the X-variables are obtained from the grid analysis of a set of molecules and therefore each variables is obtained in a definite spatial location.
The grid-plot of loadings (P) allows to identify the areas in the space that contribute most to a certain LVs.
The grid-plot of the weights (W) allows to identify the areas in the space that contribute most to explain the Y's. The interpretation of the signs in this plot is tricky, because the values of field energies are negative for favorable interactions and positive for repulsive interactions. Therefore:
3.1. Concepts.
The conceptual model underlying 3D-QSAR is that the large number of variables included in the X-matrix somewhat captures the dominating effects due to changes in structure over the actual molecules. However, we should be aware that a large number of variables in the X-matrix have no relationship with the activity and introduce only noise in the description of the molecules.
It should be considered that any X-variable, even if it does not contribute to explain the Y-variables, certainly contribute to the structure of the X-matrix. As the solution provided by PLS has the constrain of explaining the structure of the X-matrix, this structure only makes more difficult to find a solution satisfying both constraints. It has been reported that PLS is unable to obtain a good model when a single explicative variable is hidden in the middle of many others, even if this variable is highly correlated to the Y. One of the possible solutions to this problem is to remove from the data set all the noisy variables, but it is not so simple to define a criterion nor a methodology to distinguish noise from information. The GOLPE procedure was developed just to handle this problem.
A second reason for selecting variables is to simplify the PLS models, in order to make their interpretation simpler. In particular, for 3D-QSAR problems a great improvement is obtained just because the models are interpreted in grid plots, that may appear very confuse when the model include the contributions of many variables. Models obtained after variable selection produce much more intelligible grid plots and the User can focus its attention of the most important areas.
3.2. The GOLPE procedure
In few words, the GOLPE (Generating Optimal Linear PLS Estimations) procedure is a method for detecting variables increasing the predictivity of PLS models. The models obtained by using only variables selected by GOLPE are more predictive than the PLS on all variables.
The GOLPE procedure involve the following steps:
First step. Initial PLS model.
The first step is to build a PLS model using all the variables and to select the dimensionality that produces the best predictivity. When the X-matrix contains a large number of variables (several thousands), it may be necessary to apply a preselection to remove the variables which contain little or redundant information for the model.
In GOLPE this preselection is done choosing variables according to their positions in the loading space, following a D-optimal design criterion. As stated previously, the variables placed in the center of the plots of PLS loadings or weights have a very small contribution to the model. The D-optimal algorithm works in this space selecting the variables with higher spread and less correlation, and therefore, containing more complementary information.
The number of variables to be selected is defined by the User. We recommend to proceed stepwise, keeping half of the variables in each step and building new PLS models with the reduced data sets. The preselection can be repeated until some significant change on the PLS model is observed, thus discarding this last preselection, where we would risk to remove information and not only redundancy.
However, in 3D-QSAR studies the preselection might not be required, because after removing variables which vary very little and variables at few levels with skewed distributions, there may be only a few hundred of active variables left, and it is possible to go directly to the second step.
Second step, evaluation of the variables.
In the second step GOLPE builds a large number of "reduced models" similar to the complete model but removing some variables. The predictivity of each model is evaluated using CV and, from these values, GOLPE relates the predictivity of the model with the presence or absence of each X-variable.
In this step, one of the main problem, is to find the most efficient way to evaluate the individual effect of each variable in the predictivity of the models. The strategy used by GOLPE is to make a "design matrix" following a Fractional Factorial Design (FFD) scheme. In this matrix, each experience is a "variable combination" from which a reduced PLS model is built, and the measured response is the predictivity for this model calculated by CV. The matrix contains as many columns as many variables there are in the X-matrix (p), which can take a value of (+), expressing that in this experience the corresponding X-variable is used or a value of (-), expressing that in this variable combination (experience) the corresponding X-variable is not used.
x_{1} | x_{2} | x_{3} | x_{4} | x_{5} | x_{6} | x_{7} | x_{8} | x_{9} | ... | x_{p} | |
model 1 | - | - | - | - | + | - | - | - | - | + | |
model 2 | + | - | - | - | - | + | + | + | - | - | |
model 3 | + | + | - | - | + | - | - | + | + | + | |
... | |||||||||||
model N-1 | + | + | + | - | - | + | + | + | + | + | |
Model N | + | + | + | + | + | + | + | + | + | + |
Once the N models were obtained and the SDEP for each one is calculated, it is possible to calculate how the presence or absence of a given variable affects the SDEP value (the effect of this variable), and therefore the predictivity
where:
E : The effect of a variable .
SDEP+ : The average SDEP for all the models that include the variable.
SDEP- : The average SDEP for all the models that don't include the variable.
N : The total number of models.
Unfortunately, in FFD, the main effects of the main variables are confounded with some of their interaction terms, and we have the risk of selecting as relevant a variable that is not. In order to minimize this possibility, GOLPE introduces in the design matrix some "dummy" or "ghost" variables: variables that appear in the design matrix but that have no equivalent in the X-matrix. The effect of these variables, is then compared with the effect of "real variables", on the basis of a Student-t-tailoring at the 95% confidence level. The comparison is performed as follows:
The average effect of the dummies (V) is computed:
where:
E_{D }: Effect of the dummy variable D.
ND : Number of dummy variables included in the design matrix.
t_{critical} : T critical for the 95% of confidence.
The effect of each variables (E) is compared with the average effect of the dummies (V). From this comparison the variables are labeled as fixed, excluded or uncertain:
E < V : This variable is "uncertain".
E > V and E < 0 : This variable decreases SDEP (increases predictivity). Fixed.
E < V and E > 0 : This variable increases SDEP, (decreases predictivity). Excluded.
The fixed variables have a positive effect on the predictivity. The excluded variables have a negative effect on the predictivity. The effect of the uncertain, either positive or negative, can't be distinguished from the spurious effects which appear for the dummies. The number of dummies included in the matrix is also important. In general, the larger is the number of dummies, the better the decision about the relevance of the variables, but in our experience, a ratio 4:1 between the real and the dummies variables is a good compromise.
Please note that the dummies are necessary just because of the confoundings present in the FFD. In full factorial designs, the specific effect of a single variable can be evaluated unambiguously, however, at the expense of an enormous increase in the computation time. A cheaper alternative consists in "folding-over" the design, that means repeating all the combinations in the design matrix after switching the signs. This new design has the main effects unconfounded with two-variables interaction effects and, therefore, the individual effects of the variables on the predictivity can be evaluated in a more reliable fashion.
x_{1} | x_{2} | x_{3} | x_{4} | x_{5} | x_{6} | x_{7} | x_{8} | x_{9} | ... | x_{p} | |
model 1 | - | - | - | - | + | - | - | - | - | + | |
model 2 | + | - | - | - | - | + | + | + | - | - | |
model 3 | + | + | - | - | + | - | - | + | + | + | |
... | |||||||||||
model N-1 | + | + | + | - | - | + | + | + | + | + | |
Model N | + | + | + | + | + | + | + | + | + | + | |
model N+1 | + | + | + | + | - | + | + | + | + | - | |
model N+2 | - | + | + | + | + | - | - | - | + | + | |
model N+3 | - | - | + | + | - | + | + | - | - | - | |
... | |||||||||||
model 2N-1 | - | - | - | + | + | - | - | - | - | - |
Third step . Remove undesirable variables.
The last step consist in removing from the data set the variables that don't improve the predictivity of the model. It is clear that all the variables labeled as "excluded variables" should be removed from the analysis, while those labeled as "fixed" will be kept. In the method it is possible to choose between removing or maintaining the uncertain variables, but we suggest to remove them.
However, in our experience, when the elimination of variables is forced too much it might appear overfitting in the model. Also, from a theoretical point of view, force the variable selection too much can break the PLS formalism, by destroying completely the structure of the X-variables. The User should compare the risks of keeping in the model:
Our advice to be conservative and to apply always a "mild" variable selection.
The whole GOLPE procedure can be repeated several times. In each step, a first PLS model is obtained and variable selection is applied to the data. In the next step, the reduced data file is used to obtain a new PLS model and to select variables again. Usually, after two selection of variables, the number of variables remains more or less unchanged. Also, the considerations stated above can be applied here: never force the variable selection too much.
Additional remarks:
GOLPE is a procedure which works evaluating the predictivity of the models using CV and, therefore, it is extremely sensitive to a misuse of the CV. In particular, one should make sure that the data set don't contains outliers nor dangerous clustering. If not, there is the risk to select the variables that better predict the outliers, obtaining a model of no general validity.
The variable selection can increase the predictivity of a model, but when the whole model has very little or no predictivity at all GOLPE cannot improve their quality, but it would derive chance correlations. Our advice is to never apply variable selection to models with low or negative Q^{2}. In any case, the User should be aware that under these conditions, there is a high probability of obtaining spurious results, and he should try at his own risk.
4.1. Concepts.
In 3D-QSAR problems, the change in the values of the grid-fields variables are produced by some structural variation in the molecules of the series. It is obvious that structural differences between different molecules of the series will not be reflected in a single variable, but in "groups" of several variables. It comes from this reasoning , that the structural changes that should be described are better represented by these groups of variables that by individual variables.
GOLPE offers the possibility to generate groups of neighbor variables in the 3D-space which represent the same structural information. These groups can be used in the variable selection procedure, in such a way that the "groups" replace the role of the individual variables in each step of the procedure, and therefore:
There are other reasons supporting the use of groups of variables:
4.2. The SRD grouping algorithm.
There are many ways in which the X-variables can be grouped. As mentioned before, the groups we conceived should include X-variables:
With the aim of obtaining such grouping of variables, we have developed an algorithm that works as follows:
Notice that the step 1 is performed on the chemometrical space of PLS weights or loadings. It is performed on the X-matrix as a whole and variables from any field are selected. Steps 2 and 3 are performed in the real 3D space around the molecules and are repeated separately for every field.
Step 1.
The User must select the number of seeds to extract from the X-matrix. They are then extracted from a space of the PLS partial weight or loadings of the dimensionality given by the User, following a D-optimal design criterion.
The seeds obtained by this criterion will correspond with variables important for the model and as independent as possible from each other. Let us suppose that we will extract the variables from a space of weights of dimensionality 2. If we plot all the variables in these 2d chemometric space, the D-optimal criteria will extract the variables farest from the center, showing higher weights, with the constraint of minimizing the correlation between them.
The dimensionality of the space should be carefully chosen, so the LV included represent the major effects in the model. Also, the User can choose to use the space of the loadings or to use the space of the weights. The results should not be too different but in our experience the space of the weights gives slightly better results.
Step 2.
This step is repeated for every field present.
The algorithm proceeds in a very simple way. For every X-variable (i) in the field, it finds the nearest seed variable (j) and it compares the distance with the cutoff. If the variable and the cutoff are nearer than this limit, then the variable i is included in the group j, otherwise it is included in the group 0 (dummy).
At this stage, the "raw groups" built only on geometrical consideration are called "Voronoi polyhedra". The number of resulting polyhedra is the same as the number of seeds. Please notice that the number of seeds can be different in each field. In fact, the number of seeds express how important is the field in the model, and therefore, the more important fields will contain more seeds while the less important will contain less seeds.
The group 0 contains all the variables that are far away from any of the seeds and therefore it is supposed to include only unimportant variables. All the variables in this group will be ignored in the variable selection and will be marked as "excluded".
Step 3.
This step is repeated for every field present.
Once the polyhedra are built, the algorithm computes for each polyhedra i three new variables that summarize the information contained in all the variables included:
Please notice that P_{i}, P_{+i} and P_{-i} are variables, which take a different value for each object.
Then the algorithm compares one by one the spatial position of all the seeds. For the two nearest seeds it computes the correlation coefficient (Pearson's r) between the summary variables P_{i} and P_{j}, P_{+i} and P_{+j} and P_{-i} and P_{-j}. The groups are "collapsed" (put together in a single group) if and only if all these three conditions are fulfilled:
r(P_{i},P_{j}) > 0.80
r(P_{+i}, P_{+j}) > 0.50
r(P_{-i}, P_{-j}) > 0.50
These correlation coefficients are computed only considering actual values. The missing values are not taken into account when there are missing values for both groups. When, on the other hand, P, P_{-} or P_{+} contains a missing value for one group and a real value for other, the collapsing is aborted. As we can see, the criteria for collapsing is highly conservative and prefers not to collapse two groups which contains quite similar information when there is some chance that the groups may be slightly different.
When two groups are collapsed, a new group containing all the variables originally included in both groups is created. In order to compared its position in the 3D space with other groups it is defined a "pseudo-seed", the coordinates of which are set as the average of the original seeds coordinates, weighted by the number of variables included in each group.
The procedure continue searching the nearest seeds and collapsing them until the distance between them is higher than a certain cutoff defined by the User. If the cutoff is set to a high value all the groups are applicable for collapsing.
It happens often, when the collapsing cutoff distance is high, that groups far away from each other (even in opposite corners of the grid cage ) collapse. There is nothing wrong in this phenomenon: simply it reveals that the in the present series the information of these two areas is correlated (a change in the structure in areas A is always accompanied by a similar change in structure in the areas B). Actually, it is extremely useful to unveil internal correlations that otherwise produce hidden misleading effects in the data set.
4.3. Configurable parameters of the grouping algorithm.
The User can configure in the algorithm the following parameters:
The algorithm is extremely flexible and depending on the parameters provided the groups can have many completely different meanings. The strategy that we recommend gives the best results for all the data sets that we have tested, but the method is open to the User for experimenting different approaches.
In our experience the best approach is to extract a large number of seeds, and build many small groups around them. All them constitute an "informative layer" around the molecule, while the variables far away from this layer are included in the group 0 and then removed from the analysis. Then, all these groups should be collapsed using high values of the cutoff. We have observed that there is an "intrinsic limit" in the number of groups: even when very different numbers of seeds are extracted, they collapse in a nearly similar number of groups.
We recommend the User to visualize the resulting groups, using the appropriate grid plot provided in GOLPE. This plot is be useful to check the position, size, correlation, etc.. of the groups generated by the described algorithm.
4.4. Regions.
Regions represent a much simpler approach to group variables because they follow only a geometrical criteria. To build regions, the User defines the number of divisions for each axis and the program simply split the cage in small cubes. All the variables inside these cubes constitute a separate region.
This technique is included in GOLPE mainly for comparison purposes. The groups, as defined in the previous sections, should produce better results in most cases.
4.5. The q^{2}-GRS method.
GOLPE includes a implementation of the q2-GRS method. See reference: Chao, S. J.; Thropsha, A. Cross-Validated R^{2}-guided Region Selection for Comparative Molecular Field Analysis: A Simple Method To Achieve Consistent Results. J. Med. Chem. 1995, 38, 1060-1066, for a full description of the method.
There are small but significant differences in the way GOLPE performs q^{2}-GRS:
This technique is included in GOLPE mainly for comparison purposes. The groups, as defined in the previous sections, should produce much better results in most cases.
Block scaling
Often, the data is organized in blocks of variables that share some common characteristics: scale, kind of information, etc… Typical examples of this situation are the 3D QSAR analyses involving several molecular interaction fields. In CoMFA analysis, the blocks correspond with the steric and electronic interaction fields and in GRID/GOLPE the blocks correspond with the results of GRID analysis with different chemical probes.
If the data in all the X blocks has a similar scale, no particular pretreatment is required. However, the different blocks often contain data showing very different variance. For example, if a set of compounds is analyzed using two probes: NH+ probe and DRY probe, the second block will show much less variation because the electrostatic interactions represented by the charged probe are much stronger than the hydrophobic interactions represented in the second field.
On one hand, the data is expresed in consistent physicochemical units, representing the energy of interaction in Kcal mol-1, and this is a valid reason for not scaling the variables. On the other hand, we must remember that the PLS or PCA methods are sensitive to the scaling of the variables. In the above example, if no scale is applied, the whole second block of variables will be ignored just because the variables in these block take small values. In such situations, we recommend the use of Block Unscaled Weight (BUW). The effect of this scaling is to normalize the scaling of the blocks thus giving the same importance to each block. The results of the analysis should be interpreted bearing in mind that the effect of blocks representing small energies of interaction was overemphasized, in order to give them the opportunity to participate in the model.
Multiblock methods
In situation like the one reported above, where a series of compounds is described by several molecular interaction fields, we can distinguish different blocks in the X matrix. Each block can be considered to give a separate piece of information, corresponding to a different "point of view" of the ligand- receptor interaction. Depending on the characteristics of the compounds and the kind of probes/fields represented, the blocks can contain information more or less independent of each other. For example, the block representing the interaction field of a series of compounds with a hydroxylic OH probe and with a water molecule will be rather similar, while it will be different from the information given by a methyl probe.
If a PCA is carried out using all the blocks simultaneously, all the information is summarized in a few LV. The contribution of each block to each LV can be visualized in the loading plots, coloring the variables according to the block they belong. However, the PCA models provide a common description for each object in terms of a unique set of scores.
Sometimes, it is interesting to gain insight about how each block participates in the model. The main questions are:
These questions are particularly important in certain kinds of studies, like the studies of selectivity of interaction [7]. Indeed, in these studies, one of the main questions is to discover which chemical groups (represented by a certain probe) can produce a better discrimination in the interaction with different receptors.
In order to answer these questions is possible to use methods as Hierarchical and Consensus PCA. In principle, both methods have in common that the PCA model is built in two levels: a block and a superblock level. The first level represents a particular description of the series from the "point of view" of each block and the second level represents the integration of the information of the block level in a unique description. These different "local" and "global" series description are represented by the block scores for each block (t_{b}) and the super scores (t_{T}).
Apart from the general structure of the model, Hierarchical and Consensus methods yield different solutions. A detailed discussion of the differences is out of the scope of this manual and can be found elsewhere [8]. GOLPE includes an implementation of Consensus PCA (CPCA) because we believe that this method is best suited for the problems we deal with in QSAR and 3D-QSAR.
Consensus PCA
Consensus PCA [9] uses the same objective function than PCA. It explains at best the overall variance of the matrix, and yields an overall model nearly identical to the PCA model. The original algorithm is an adaptation of the NIPALS algorithm.
The algorithm starts with an initial guess of superscores t_{T}, which is used to compute the different block loadings p_{b}. Then, the block loading are used to compute the block scores t_{b}. All the block scores are collected in a matrix T. The relationship between the block scores T and the superscores is defined by a weight matrix (w_{T}) which indicates the participation of each block into the overall scores. The whole procedure is then iterated until convergence of the superscores.
choose start t_{T} | |
loop until convergence of t_{T} | |
normalize p_{b} | |
normalize w_{T} | |
end loop |
As it is usual in NIPALS, the algorithm works dimension by dimension, deflating the original matrix using the block scores and loading.
As it was first shown by Westerhuis et al. [8], the block scores and loading can be obtained from a regular PCA, because at convergence the PCA scores and the CPCA superscores are identical (t = t_{T}). If it is so, no iterative algorithm is needed and we can obtain the block loading as
and the block scores and weights as
In this case, the matrix is deflated using the PCA scores and loadings instead of the block scores and block loadings.
This non-iterative algorithm has some advantages in terms of convergence with respect to the above algorithm and, in addition, provides more reliable results in presence of missing values. Indeed, this is the algorithm used by GOLPE.
Since the superscores are extracted from a regular PCA, much of the CPCA model is exactly identical to the PCA:
GOLPE does not apply any scaling to the blocks. Different authors recomend to weight block by the square root of the number of variables into each block (m_{B}) in order to make the importance of the block independent of their size. This makes sense especially when data is autoscaled, because in this case the variance of the block is linked to the number of variables. However, for the data used in 3D QSAR, the variance explained by each block is not strictly linked to the number of variables and in this particular case we recomend to use BUW instead.
Another peculiarity of the GOLPE CPCA algorithm concerns the meaning of the percentage of variance explained for each block. Westerhuis [8] sugested to compute from the residuals, as:
However, the most interesting part of CPCA is, in our opinion, the lower level and the amount of block variance explained by the "local" model is more informative. Therefore in GOLPE we compute the variance as:
please notice that this method does not guarantee that the addition of a further LV produces an increase in the variance explained that for a particular block. The variance explained within the blocks is usually higher than the variance explained in the overall model for this block, because each block can have its own block scores, while in the overall model all the blocks share the same scores.
Interpretation
The results of the CPCA model are essentially similar to the PCA, but it allows to gain a valuable insight into the role of each block in the model.
1. Superscores
As stated above, the superscores are identical of the scores obtained in PCA.
2. Block scores
Each block gives a different picture of the series, from the point of view of the block variables. The block scores are different for each block and different from the PCA scores.
The block scores can be seen as the scores of a "local" PCA that explain the variance within the block, but they are not equivalent to the scores that would have been obtained from a separate PCA made in each block. From the above algorithm, we can see that the scores were obtained using the block loadings (p_{b}), which in turn were extracted from analysis of the overall model.
The block scores can be seen as the "opinion" of each block in the description of the series. This "opinion" is then elaborated at a higher level in order to obtain a "consensus" and therefore, some of the block description is lost and does not appear in the overall model.
3. Block loadings
Are quite similar to the PCA loadings. They differ mainly in the scale, since they are normalized to length 1 for each block, while in PCA it is the whole loading vector which is normalized to length 1.
4. Superweights
Express the relative importance of each block in the overall model, for each model dimension.
With respect to the above questions; the number and identity of the blocks more relevant for the overall model can be appreciated in the superweight plot. If some blocks appear together they are holding more or less the same information. Often this plot shows that one blocks participate in one LV and others blocks in a different LV. This is also important to interpret the meaning of the LV’s themselves. If the questions concern the ability of the blocks to distinguish between objects or cluster of objects, the most important plots are the block scores. When two objects/clusters appear close in the plot of block scores, the information in this block is not able to discriminate between them. If they are far from each other, the information in this block is able to discriminate them.
CPCA in selectivity studies
In studies of species selectivity like [7], the use of CPCA allows to work in a completely different metric. In the classic GRID/PCA method, the interactions of the probes with the receptors are considered objects. Assuming only two receptors, the PCA scores plot usually shows two clusters of objects, each one representing a family of interactions. Usually two LV are relevant: the first represents the discrimination between the two clusters and the second ranks the probes according to the strength of the interaction. The inspection of the loadings for the first LV (the one involved in the discrimination) highlights the regions of the receptor more important for designing selective compounds.
With the use of CPCA, the fields are added side by side. In this case, the score plot does not show two clusters of objects but only two objects. The information of the relevance of each probe for the discrimination is shown in the plot of superweights. As in the previous method, the loading can be inspected to find the regions more relevant for the discrimination, but additionally, the inspection of the block loadings allows identifying these regions independently for each probe of interest.
A drawback of the GRID/PCA method is that the interaction with different probes has different scales. Therefore, it is not easy to say if a probe is selected as important for the discrimination because the interactions are really more selective or just because the interactions are more intense. On the contrary, in CPCA is possible to apply BUW in order to give exactly the same initial importance to all the blocks of variables.
Active multiblock plots in GOLPE
GOLPE includes some active plots specifically designed for helping in the interpretation of CPCA models: Active CPCA, Active CPCA differential and Active CPCA blocks.
The firsts two plots are essentially active PCA plots. They can be obtained separatelly for the different X blocks of the matrix and show in the screen a 2D plot of block scores and a pseudo field Grid plot. In the plain version (Active CPCA), the User can click in any position of the 2D plot and the Grid plot will show a representation of the field which would be generated by a compound present at that position in the block scores plot. The differential version of this plot permits the User to draw a vector in the 2D plot. In this case, the Grid Plot will present the differences in the field which would exhibit two objects placed in the extremes of the vector in the scores plot. In the computations only the two firsts LV are taken into account.
The values of the pseudo-field are obtained simply as:
X_{b} = t_{b}.p_{b}
being t_{b} the value of the scores clicked by the User. Obviously, the values of X are obtained only for the block selected and correspond only to the variance explained by these two LV.
The Active CPCA blocks plot is a differential plot that shows in the screen a superscores 2D plot and a dynamic histogram. The User can click in two positions of the CPCA superscores plot (idem to the PCA scores plot) and the histogram will show the relative distance between the two clicked points in the different block scores plots. This plot is useful to understand the relevance of the different blocks in the discrimination between two objects or clusters of objects, and therefore is important in selectivity studies, mainly when more than two kinds of receptors are being studied.
The relative distances are computed as:
t^{D} is the vector obtained as t^{B}-t^{A}, being t^{B} and t^{A} the points clicked by the User
from t^{D}, a pseudofield X is computed
X = t^{D}.p
And X can be spliced in different blocks X_{1}…X_{m}
And from this formula we can obtain as
The length of the vector is computed and normalized so the largest vector takes a value of 100. Please notice that the vectors can be compared because the p_{b} are normalized to length 1, but still the values of t_{b} are dependent on the block scale. We recommend using this plot only on data with consistent block scaling.
Manuel Pastor, 1999