TUTORIAL 1

GOLPE Analysis of a set of GRID field

Objectives

The interaction field between a phenolic hydroxyl probe and a set of 47 compounds have been computed with GRID. We want to extract information from the grid field and eventually obtain a 3D-QSAR model. The GOLPE program will be used.

 

Sections

 

Conventions

[MENU] Means that you have to choose the menu option identified by this label. Usually you have to click with the mouse or type the label. The labels separated by the symbol >>> mean that you have to "navigate" some submenus. For instance A>>>B>>>C means "choose menu A, then a submenu appears where you have to choose option B, then a submenu appears where you have to choose option C.

[DIALOG] Means that a dialog window is open for the user choice. Select the option indicated.

[BUTTON] Press the button with this label

[RETURN] Press the <Enter> key

 


Create a new data file

The starting material to run this tutorial can be find in the CD-ROM, under the directory tutor1. Copy all the contents of this directory to a directory in your home directory:

 

 

Change to the tutorial1 directory and list its contents.

 

 

In this directory we have included the file grid.kont which corresponds to the output of a GRID analysis of 47 Glycogen Phosphorylase inhibitors. The activity of the compounds was already introduced in GRID, so it will appear in the kont file.

 

Import the kont file into GOLPE

 

 

Please notice that, while importing data, GOLPE lists the names and the activities of the compounds in the series.

 

 

The series contains 47 compounds. The data file contains the interaction energies computed in a grid of 20x18x22 nodes = 7920 interaction energy measures for each compound, but only 7664 X variables show some variation, the others have constant values (the same value for each compound in the series).

 


Pretreatment

Pretreat the data. Raw data is not suitable for chemometrical analysis. Some kind of pretreatment is always required. We will use the Advanced Pretreatment Tool in GOLPE to diagnose and pretreat our data file

 

 

A window appears, showing the main characteristics of the data. First we will convert very small X values to 0 (zeroing). This operation decreases the noise contained in the field energies.

 

NOTE: Use the mouse to move the slide, but to make small modifications use the arrow keys.

 

To assess the parameters to apply it is possible to present X-profiles and histograms that represent the data distribution. Advanced users are encouraged to explore the Histogram and Profile buttons.

The effect of the change is immediately reflected in the window. Notice that after the zeroing of values many variables became "inactive", because their variance is now very small or they do not variate at all.

 

Next we will remove variables with very small variance.

 

NOTE: Use the mouse to move the slide, but to make small modifications use the arrow keys.

 

To assess the parameters to apply it is possible to present X-profiles and histograms that represent the data distribution. Advanced users are encouraged to explore the Histogram and Profile buttons.

The effect of the change is immediately reflected in the window. Notice that after this change only 2111 (26%) of the variables are active and will be used for the analysis.

 

Next we will remove n-level variables

 

 

The effect of the change is immediately reflected in the window. Notice that after this change only 2094 (26%) of the variables are active

 

Now we will save the pretreated data file in a different GOLPE file. In this way we will not have to pretreat the data each time we want to work with this series. After that leave the Advanced Pretreatment Tool.

 

 


Principal Component Analysis

Open the file with the pretreated data.

 

 

The data file contains the interaction energies computed in a grid of 20x18x22 nodes = 7920 interaction energies measured for each compound.

The series contains 47 compounds. Only 2094 X variables show some variation, the others have constant values (the same value for each compound in the series).

 

We will first perform a Principal Component Analysis, in order to understand better the types of compounds present in the series and to detect singular points.

 

 

Please wait a minute....

 

The percentage of X variance explained for each model dimensionality is presented. We will see some scores and loading plots

 

 

Maximize the windows clicking in the big square icon in the upper right corner or iconize the windows clicking in the small square icon in the upper right corner. 2D plot shows a pop-up menu when the right mouse button is pressed.

Grid plots are interactive and allow the user to move the plot in 3D. Grid plots show negative values in blue and positive values in yellow

For helping the interpretation is convenient to load some structures into the plot. In the Desktop click the folder icon and then the folder tutor1. Select the files "inh_glucose.pdb" and "inh_thio3.pdb" and drop them into the graphic window of the Grid plot. The structures of the inhibithor should appear inmediatelly in the screeen. To avoid repeating the same operation for each Grid plot, go to the menu Molecules>>>Molecules Manager in the Grid plot window and click on the button, Save As Template. The next Grid plot will load automatically the same structures:

It is convenient to simplify the Grid plots by a smaller cutoff value.

 

 

Explore the plots. Identify patterns. Find relations. Try to answer these questions:

  1. How many kind of structures do we have?
  2. What makes these structures different?

 

Delete all 2D plots by selecting in the respective pop-up menu

 

[MENU] exit

 

Delete all Grid plots by selecting in the respective pop-up menu

 

[MENU] File>>>Quit


Partial Least Squares

Start generating a PLS model

 

 

After a minute, the percentage of Y variance, for each dimensionality of the PLS model is presented.

 

To decide the right dimensionality of the PLS model perform cross-validation. Use randomly formed groups (5 groups and 20 randomizations) which produce more conservative results than Leave One Out (LOO) cross-validation and is more adequate for QSAR problems.

 

 

After some minutes, the results of the cross-validation are presented in terms of SDEP, SDEV and Q^2

SDEP Standard Deviation Error of the Predictions (lower is better)

SDEV Standard Deviation of the SDEP values (lower is better)

Q^2 Cross-validated correlation coefficient (R^2) (higher is better)

 

However, it is easier to visualize these data in a plot:

 

 

In these plots is clear that the predictive ability of the models are better using 4 components (highest Q^2 and lowest SDEP). The model at 4 PC has R^2 = 0.9085 and Q^2 = 0.4279

 

Even if the cross-validation results are a good index to assess the dimensionality of the model it is always advisable to inspect the PLS plots (T-U Scores plot). Useful dimensions in a PLS model can be recognized because their PLS plots show a rough correlation for most of the objects in the data set. Spurious dimensions are those explaining individual objects. The PLS plot for the first PC is, by far, the most important plot in a PLS analysis.

 

 

Also PLS weights and PLS loadings 2D plots are useful to understand how and how many variables contribute to the model. Represent the weights and loadings plots.

 

 

Notice that only a few variables contribute to the two first PC, most of the variables have very small loadings and are represented in the big cloud around the center of the plot. When the X data contains more than one field, these plots can be used to highlight the contribution of each field to the overall model.

 

IMPORTANT! Please do not remove the PLS-partial weight plot from the screen. It will be used later on.

 


Variables Selection

In the last plots it was obvious that most of the X variables do not contribute to the model. On the contrary, for the PLS algorithm it is difficult to find a good solution for the Y’s fitting because of the constraint imposed by the X's description. When the model contains so many X variables is advisable to perform variable selection.

First we will define groups of variables using Smart Region Definition (SRD) and then we will use them to carry out FFD variable selection.

 

 

NOTE: Use the mouse to move the slide, but to make small modifications use the arrow keys.

These settings are optimal for visualizing the groups, for practical purposes it is better to use more seeds and a smaller critical distance. The groups can be inspected using grid-plots.

 

 

This option represents in the space the groups of variables defined by the SRD method. In the plots you have to notice how the method produce groups that enclose variables bearing the same information (produced by the effect of a single "piece" of the ligands). Moreover, notice how big groups are in regions where there is no much information, while the small groups are in the most important areas. Group 0 is special, as it contains all variables not important for explaining the activity, which are therefore removed from the analysis.

 

These groups will be used to perform FFD variables selection. When using groups, the FFD method evaluates the effect of a group of variables on the predictive ability of the model (evaluated by cross-validation). Groups not contributing significantly to increase the predictive ability are held out of the model.

 

 

These settings will speed up the computation, usually it is better to use Combination/Var. ratio 2.0 or higher.

 

IMPORTANT! This will take some time, depending on the speed of your workstation.

 

Delete the variables not selected.

 

 

Now these variables are removed from the analysis. Rebuild the PLS model and repeat the validation:

 

 

Notice the improved fitting and predictive ability of the new model. Represent again the PLS weights and compare the plot with the weights obtained before.

 

 

What is different? It is also advisable to evaluate again the PLS model.

 


Model Interpretation

The value of a 3D-QSAR model depends critically on its interpretability. GOLPE can produce several different Grid plots of the results that can help in this process. The most important one is the PLS-coefficient plot.

 

 

Contour the values using a suitable value (for instance -0.004 and +0.004)

 

Blue regions A favorable (negative) interaction INCREASES activity.
A unfavorable (positive) interaction DECREASES activity.

Yellow regions A favorable (negative) interaction DECREASES activity.
A unfavorable (positive) interaction INCREASES activity.

 

However, most of the regions in this example are produced by negative interactions. As a rule of thumb, negative regions highlight areas where a hydrophilic group can increase activity, while yellow ones highlight areas of the receptor where hydrophilic interactions result in a decrease of the activity.

In this example the structure of the receptor is available. This structure can be used to validate the results: if the model is correct, the coefficients must highlight the residues that play the most important role in the interaction. Therefore load the structure of the receptor into the plot.

 

 

Contour the values using a suitable value (for instance -0.004 and +0.004). Examine closely the plot. Try to find an interpretation for the coefficients. Why some of the coefficients are not too close to the residues?

 

Suggestion: GOLPE contains many other plots; Predicted vs. experimental, Calculated vs. experimental, Residuals, Objects differences, etc... Do not hesitate to play with them. Look up their meaning in the manual.

 

When you have finish, please close all windows and exit from GOLPE

 

 

We hope you have enjoyed this practice. If you have any comment you can contact the author at mia@miasrl.com