GRID/GOLPE analysis on a set of thrombin inhibitors
Build, validate and interpret 3D-QSAR models on a series of thrombin inhibitors taken from literature (ref 1,2,3).
|[MENU]||Means that you have to choose the menu option
identified by this label. Usually you have to click with the mouse or
type the label. The labels separated by the symbol >>> mean
that you have to "navigate" some submenus. For instance
"choose menu A, then a submenu appears where you have to choose
option B, then a submenu appears where you have to choose option C.
|[DIALOG]||Means that a dialog window is open for the user choice.
Select the option indicated.
|[BUTTON]||Press the button with this label
|[RETURN]||Press the <Enter> key
Compounds have a tripeptide-like structure, with a free ionizable amine in P1, usually linked to a cyclohexyl or an aniline, a proline in P2 and different substituents in P3.
The activity measured for all compounds is the thrombin constant of inhibition (Ki) and has been used in inverse logarithmic form (pKi) through the work.
This series contains 48 compounds, that can be classified into four families, according to the kind of substitution present at the P3 position. The compounds were colored and given names according to the chemical family to which they belong.
Click on any of the names above to see a table with the structures.
Conformation and alignment
The structure of the compounds was obtained by modelling. In order to obtain a sensible conformation and superimposition, the most active compound (BSN10) was docked into the active site, as found in the complex thrombin-PPACK for which a x-ray crystallographic structure is available. The rest of the compound were build using the structure of this compound as a template. The structures were further optimized using SYBYL force field and standard conditions.
The P1 position is well superimposed, but for the compounds having an aniline in this position. Also, the prolines in P2 are well superimposed. As you can see, most of the structural variability is present in position P3. Families BOC and BSN include substituents that protrude clearly from the common core. Compound EST02, which has a quite peculiar structure, is poorly superimposed.
As usual, the search for a suitable conformation and alignments for all the compounds is one of the most time consuming steps of the work. Indeed, this series is particularly difficult, due to the flexibility of the compounds.
Please notice that the obtention of a good conformation and alignments is a must. If the superimposition is wrong, no chemometrical method would ever be able to obtain a good model.
Once every compound was superimposed, the molecular interaction field around the compounds was measured using program GRID (ref 4). A cage of 23x28x26Å was obtained, yielding 16744 ligand-probe energy measurements for each compound.
In order to obtain a "realistic" model, the measurements should represent the same kind of ligand-receptor interactions present at the active site. In order to choose the best suited probe, a preliminary study was carried out, using five different probes:
|DRY||The hydrophobic Probe||white|
|O::||Carboxyl oxygen atom||red|
|OH||Phenolic hydroxyl group||blue|
|C3||Methyl CH3 group||green|
We will load into GOLPE a previously computed GRID kont file containing probe-ligand interaction for each compound in the series. Since we have used 5 probes we will obtain a matrix with 5x48=240 objects.
When the computation is finish, GOLPE will show the amount of variance explained by each PC. Please notice that the first two PC explain around half of the X variance in the dataset. Now we will look to the scores plot. In this plot we will see a simplified 2D picture of the data. Even if it uses only 2 variables, this picture contains most of the relevant information contained in the original matrix of 16744 variables. Objects here are molecules-probes pairs and have been colored according to the probe.
The plot shows three main clusters: from left to right they correspond to the DRY-ligand pairs (white), the NA+-ligand pairs (cyan) and the rest of the probes (red, green and blue).
In this plot, the relative power of the probes to distinguish among the object can be seen as the vertical spread of the clusters. The DRY probe (white points) gives nearly the same information for all the compounds, and all appear clustered tightly. On the other hand, the O:: and the OH probe seem to produce the best results.
The PCA is also a good tool to inspect the results of the alignments. The scores plots shows the different molecules in positions that reveals their similarities. If one of the compounds appears very different this can be the result of a mistake in the alignments.
Another way to see the same result is to make a tentative, rough PLS model.
Let we see a plot of recalculated versus experimental:
In this plot we can appreciate the different ability of the probes to describe properties relevant for explaining the activity. In accordance with the PCA scores plot we can see that in this dataset the probe DRY is not good, while the OH probe gives the best results.
Therefore, we decided to choose the phenolic hydroxyl probe (OH, blue) for the GRID/GOLPE analysis.
PCA and PLS modeling
Once we have decided the probe to use we will repeat the PCA analysis in a matrix containing only the OH probe.
Now we will have only OH pairs (48 objects).
In order to have a better understanding of the objects in the plots, we will start assigning colors to the objects according to the chemical families they belong and according to their activities.
Repeat the PCA analysis:
Please notice that in this plot, the different families of compounds appear in different clusters. Put the cursor inside the window and press the <N> key. The plot will show the name of the compounds.
The first PC is devoted to explain the differences between the benzylsulfonamide compounds and all the rest, while the second explains the difference of BOC compounds. Notice that the families exhibiting more important differences are explained in the first PC's.
Now, color the compounds according to their activities. Hold pressed the right mouse button and select symbol color>>>spectrum.
As you can see, the most active compound (BSN10) is in the middle of the plot. Apart from this no axis exhibit a clear relationship with the activity.
PCA can be also used to show structural features of the compounds that covariate in the series as a consequence of the lack of design. It is very important to be aware of this covariance in order to avoid misinterpretations of the models.
This will open two windows. The 2D plot represents the scores for the first two PC.
We will start importing a couple of structures represented in opposite sides of the plot, for instance BSN07 and BOC01. In the Desktop click the folder icon and then the folder tutor2. Select the files BSN07.kout and BOC01.kout and drop them into the graphic window of the Grid plot. The structures of the inhibitor should appear immediately in the screen.
Open Molecules>>>Molecules Manager, select one item in the list and then press Molecules color to color BSN07.kout of cyan (palette, color 10) and BOC01.kout of orange (palette, color 6).
To avoid repeating the same operation for each Grid plot, go to the menu Molecules>>>Molecules Manager in the Grid plot window and click on the button, Save As Template. The next Grid plot will load automatically the same structures and settings.
Make click on the left side of plot with the middle mouse button and then click again with the same mouse button on the right side and at the same height. A red arrow is draw between these two points in the 2D plot, while in the grid plot will appear a field which corresponds with the differences in field values between two real or hypothetical objects placed in those two positions of the scores plot.
Please notice: in GOLPE, yellow regions correspond with POSITIVE values, while blue regions correspond with NEGATIVE values. Since the surfaces can overlap the structures, it would be helpfull to make the surfaces semi-transparent (use Data>>>Data Levels, and then press Material) or represent the data with symbols (Data>>>Symbols).
When the arrow is parallel to an axis, highlights those structural features which make objects different in this PC. In our study, the first PC is devoted to explain the most different family of compounds; the BSN. If we study the field we will appreciate that this family is peculiar in many respects. Near P1 there is a blue cloud in the region that is occupied by amino cyclohexyl and not by aniline. No field is present in P2, since all compounds are quite similar in this region. In P3, there is a blue region indicative of the absence of the second ring, the yellow correspond with the presence of the sulfonamide moiety. Even over the phenyl ring there are small areas that correspond with the increased presence phenyl and Cl-phenyl rings in this series.
The important point is that all those structural features covariate. In other words: since the presence of the sulfonamide is nearly always present together with the absence of the second phenyl ring in P3. Any model will correlate both structural characteristic with the activity and it will be impossible to know if the increase in the activity is a consequence of one (the presence of the sulfonamide) or of the other (the absence of the second phenyl ring).
Please close the 2D plot, placing the cursor over the window and pressing the <X> key, and the Grid plot, pressing Alt-Q.
Before to proceed to make the PLS model, variables should be treated in order to remove "noise" at a maximum. In particular, very tiny values of interaction energy should be converted to zero, in order to remove spurious variation of the data. This operation is called "zeroing". Afterwards, variables showing very small variation through the series (variance, as it is called in statistical terms) should be removed. The philosophy in GOLPE is to take out initially only variables with very, very small variance. In this way we disagree with the standard way of working in CoMFA (ref 5) that suggest to remove much more variables from the very beginning (MINIMUM SIGMA). Indeed, if the variables are removed this way, we apply a variable selection procedure, and a very rough one since it is guided by how much the variables variate and by no other criterion.
A latter step in this treatment is to remove the variables which show a skewed distribution. Those variables take only two or three different values in all the series: one value is taken by only one or two of the objects and the other for all the rest. As a consequence, these variables were used by the model to explain perfectly those few objects, thus producing spurious results. In GOLPE, N-level variables can be easily inspected and removed.
As you can see, they correspond with positions occupied by unique compounds in the series. Therefore we will remove the 2-level variables.
The pretreatment in GOLPE is carried out using the Advanced Pretreatment Tool
Start by removing very small values.
NOTE: Use the mouse to move the slide, but to make small modifications use the arrow keys.
Next we will remove variables showing very small variance.
NOTE: Use the mouse to move the slide, but to make small modifications use the arrow keys.
The next step is to remove N-level variables:
Now we will save the pretreated datafile, and reload it. If you performed correctly the pretreatment you should have 4008 active vars. If it is not so, please DO NOT SAVE THE FILE:
And then we can proceed to obtain the first PLS model and validate it:
The validation will take a couple of minutes to be completed. The values of r2 and q2 are printed, but can also be inspected graphically:
It is remarkable that the first PC explains a lot of X variance (25%) but no much Y variance (30%). This is an unusual effect, typical in datasets which contains very different kinds of objects. This means that the first PC's will be devoted to explain inter-cluster differences. The same can also be seen in the PLS plots for the first PC. The PLS plots represent the relationship between the activity and the projection of the X, showing the X-Y correlation that controls the PLS model building for each dimension.
As you can see, the first PC do not produces a very good correlation with the activity, but discriminates between BOC and BSN families and all the others. The second PC completes the models and gives a better correlation
All in all, those plots show that the first PC is mainly devoted to explain the structure of the data and the inter-group differences. This fact is reflected in the modest percentage of Y explained by the first PC, when compared with the amount of X variance explained, and also in the regular growing of both r2 and q2 in the first PC. The values of q2 are not to good, but there are no negative values and there is a clear tendency to increase.
It is also important to see how the original variables contribute to the PC. This can be done by plotting the PLS weights:
IMPORTANT. Please do not remove the PLS-weight plot, it will be used latter on.
The weight plot reveals large clouds of variables in the center, no contributing at all to explain the activity. These correspond with variables at positions no relevant to explain the biological activity of the compounds or not showing enough variation to contribute to the model. In those situation, when the ratio variables/objects is large (ratio in the order of thousands or hundreds) the use of a suitable variable selection procedure is advisable. Variables selection can be used only in the presence of a model. If the preliminary model shows small r2 and/or q2 then the use of variables selection is discouraged and can lead to spurious results.
PLS variables selection
In this work we will use Fractional Factorial Design variables selection (FFD). In this method, the effect of introducing every variable into the model is estimated by analyzing the results, in terms of cross-validated SDEP, of a huge number of models in which some variables were keep or removed according to a FFD criterion. However, since variables neighbor in the space represent exactly the same information it is advantageous, in terms of speed of computations and of stability and interpretability of the results, to perform this evaluation on groups of variables, instead of on individual variables. Hence, we will carry out first a Smart Region Definition (SRD).
NOTE: Those options were chosen only for demo purposes. The SRD will work better with around 1000 seeds, a critical distance of 2.0 and the maximum collapsing distance. However this calculation would last for some hours.
Please wait a couple of minutes.
The resulting groups of variables can be inspected graphically.
Since it contains many variables, hold pressed the right mouse button and select "positive regions>>> >1". You will see that this region contains variables present in regions showing small variability and that can not be put in relation with the activity. Indeed, they correspond with the cloud of variables near the center of the weights plot.
You can visualize other groups. We suggest you to visualize the group 52 as an example of a group that has resulted after the collapsing of other two groups containing the same information. As a general rule, large groups correspond with areas containing few information, for instance, groups 76 and 63 (containing around 30 vars.) are placed near the P2 moiety.
In the following step, those groups will be used like if they were single variables in a FFD variables selection procedure. This way, methods like the D-optimal preselection of variable is no longer needed.
IMPORTANT, select Execution in a window. Press the OK button. The FFD variables selection can take a long time to compute. After a few seconds the window will show an estimated time of completion. Since this time is long, we will kill the process simply by closing the window.
Fortunately, we have pre-computed the variable selection and can proceed directly to remove the unselected variables.
This corresponds with the second consecutive FFD carried out in the data set. After the variables selection we can re-build the models and appreciate how they have improved with respect to the previous ones.
The plot of r2 and q2 shows that the selection of variables has improved significantly the model. Even if numerically the model with 5 PC shows the best "numerical" quality indexes, in the graphic it is evident that the improvement of both the fitting and the predictive abilities are modest and do not compensate to add another PC.
Lets see the effect of the selection of variables in the PLS weights:
Compare the new PLS weights plot with the one obtained before the variables selection. The global aspect of the plot is the same, but we can appreciate that the nucleus was less dense and some clusters of variables not contributing to explain the activity have been removed (see the cluster in the - - quadrant, near the origin). It is possible to identify those variables removed in this cluster using the clipboard.
In the old PLS partial weights plot select an area clicking with the middle mouse button. Each click will define the corner of a polygon that must enclose the area chosen. When the polygon was closed the variables inside will be highligthed in the grid plot.
As you can see, those variables are far from the molecule or represent information which is better represented by variables more close to the structure.
The overall quality of the model can be inspected using calculated vs. experimental plots and residual plots.
In the recalc vs. exper. plot we can appreciate that the fitting is acceptable. In the Recalc. Residuals plot we saw no pattern. In those plot it is always useful to color the points by chemical families to show more clearly patterns or tendencies.
Once we have a model with satisfactory predictive properties, we must interpret it in order to answer the key questions:
Those questions should be answered in plain chemical terms, and not in a mathematical or statistical way. This part of the work is the so called "interpretation" and demands from the chemist a certain knowledge of both the chemical and the chemiometric part of the research. In our experience, is a difficult part and indeed we are committed in the development of new tools for this step.
Interpret a 3D-QSAR model implies to look to different grid plots, in which the chemometrical information obtained by the model is represented in the real space around the molecules. In the following plots we have pre-loaded the structures of BSN09 (colored by atom type) and of BOC01 (in orange) for reference. Feel free to load any other structure if you wish, proceeding as described above.
The basic plot is the PLS pseudo-coefficients plot. In this plot we represent the relative importance of each position around the molecules for explaining the activity.
The values represented here summarize the properly weighted contributions of each PC, up to a certain level. However in this plot the signs of the coefficients can induce to errors: coefficients have opposite meaning depending on the fact that the compound produces positive or negative field values in this area:
|Blue regions||A favorable (negative) interaction INCREASES activity.|
|A unfavorable (positive) interaction
|Yellow regions||A favorable (negative) interaction DECREASES activity.|
|A unfavorable (positive) interaction
For example, identify in our plot the large positive (yellow) values placed over the sulfonamide moiety. Since the most active compounds have this peculiar substituent, positive coefficient actually overlap the subsistent. The compounds of BSN family, will place the substituent here thus producing positive field values that will be multiplied by the values of the coefficients.
A second graphic tool will helps us to understand the effect of those coefficients on the values of activity assigned by the model to a certain object: this is the activity contribution plot. The activity contribution plot is different for every object and results from multiplying the values of the coefficients by the actual values of the field for this object.
Here, it is evident that in this particular compound, the coefficients mentioned before contribute to increase the activity of the compound. Please notice that nearly every value is yellow (positive). Since it is the most active compound in the series, all its field values are recognized by the model as favorable for the activity. The active contribution plots are powerful tools for understanding the meaning of the different coefficients given by the model.
However, it is tedious to open different plots for every object. In order to simplify this operation we have developed the active PLS plots.
This will open two windows.
For instance, click near the object BSN09. You will see an activity contribution plot which shows the contribution to the activity given by different areas, according to the model (2PC). If you click now at the bottom of the plot (near BOC01) you can appreciate that the yellow areas at the top of the grid plot change from yellow to blue.
This region, which has negative coefficients, (see the coefficient plot) is responsible to assign higher activities to members of the BSN family, because the sulfonamide group produces negative interaction there and assign lower activity to member of the BOC family, because the BOC substituent, located inside the area, produces positive field interactions.
Particularly interesting are the values of the activity contributions given when we click on the upper right corner and on the lower left corner of the 2D plot. The first correspond with the activity contributions of a compound that the model predicts to be of maximum activity and the second with a compound of minimal activity. The information of those plots would be of great interest on the design of new compounds.
If we are looking the reasons why a compound or a family or compounds are more active than another, a even better tool is the differential active plot.
In this tool we see exactly the same kind of plots we saw before but instead to see the values of the active contributions in a certain position, we will see the differences in those activity contributions between two positions in the scores plot, marked by consecutive clicks with the middle mouse buttons.
A latter question is to know if actually the model is reproducing the receptor, in other words: if from the map of coefficients we can infer information about the structure of the receptor. In principle, the answer is no, since the model we get uses only the information provided by the series and therefore the model of the receptor will be extremely sensitive to lack of design, correlated structural features and alignments. To a certain extent, an mainly applied to areas where the structural variability is high, we can have some hints of the kind of L-R interaction present.
Adjust cutoffs using Data>>>Data Levels. As you can see, some of the coefficients actually map important residues, like for example the aspartic acid that makes interactions with the charged amine at P1 position. Other interaction might be seen as representing favorable van der Waals interactions while some others have no clear explanation, for example the cloud of negative (cyan) coefficients over the sulfonamide group.
WEB design by Manuel Pastor, 1999