[D-optimal preselection][F.Factorial selection][SRD][Region selection]
Variable selection procedures evaluate the effects of individual variables on the model predictive ability in order to determine which variables are relevant to the problem under study. In the first stage of the analysis a smaller number variables are extracted from a large amount of redundant information.
After any variable selection procedure, a list of all the unselected variables is stored on disk. From this list, the unselected variables can be removed from the data file at any time using the Pretreatment>>Delete unselected var.s command.
GOLPE includes two approaches to perform variable selection: D-optimal preselection and F. Factorial selection. See the Background section for a detailed description of both procedures.
Please note that when the number of X-variables is not too high (less than 1000) it is appropriate to apply immediately the F. Factorial selection procedure, while the D-optimal preselection is required in most 3D-QSAR or multivariate calibration studies, where the X-variables are several thousands. In the current version it is not possible to apply F. Factorial Selection to more than 4000 variables (active plus dummies).
D-optimal preselection can be performed only after a PLS model has been generated. The command opens a dialog window in which it is possible to set the following options:
Selects the dimensionality of the LV space considered by the method (see Background). It is important to choose carefully this parameter. Too few components will miss some information contained in the model and too many will include noise in the selection. The optimal dimensionality can be known inspecting the PLS plots for the various components. In general, the dimensionality which shows the highest q2 (lower SDEP) in the validation would be a good choice, but our advice is to select a low number of components, say two or three, to take into account only really major effects. Moreover, when the model is validated using random groups, it important to check the dispersion of the SDEP values obtained, reflected in the SDEV(sdep). In particular, a dimensionality showing a higher SDEP with a small SDEV(sdep) is always preferable to one showing lower SDEP with higher SDEV(sdep).
Selects the chemometric space in which the D-optimal preselection is performed. It is possible to choose between loading space or partial weights (W's) space. In our experience the latter choice (the default) is slightly better, but it might be appropriate to check both.
Percentage of active X-variables removed from the data file. The D-optimal preselection extracts the variables in such a way that most of the redundancy is reduced but enough collinearity is maintained among the remaining variables to satisfy the PLS algorithm. A good choice is to select not less than 50% of the active variables (the default), and to repeat the operation until the model in fitting begins to change (see Background).
When all the settings are correct press the OK button and the D-optimal preselection will start . Press the Cancel button to abort or the Defaults button to set all the settings to the default values.
The calculation will run in an independent window, in which information about the progress of the selection will be shown. When the calculation is over, and after a few seconds, the window will automatically disappear from the screen, and the results can be accessed trough the Pretreatment>>>Delete unselected var.s (D-optimal) command.
X-selection>>>F. Factorial selection
The dialog window is divided into three parts by horizontal lines. The upper part is identical to that shown in Modeling>>>Validate PLS model with the only difference that in the Validation mode the Specific Groups option is not present. Please refer to the section which describe these commands for details. The lower parts are new and will be described here.
FFD Selection Parameters
FFD method parameters. This box includes check boxes that define some parameters of the variable selection methodology.
Use grouping of variables. This check box is sensitive only when groups or regions have previously been generated (see Background section for details). If this option is ON the selection of variables will be performed on groups of variables instead of individual variables. This means that, what is going to be evaluated is not the effect on the predictive power of single independent and isolated variables but the effect on the predictive power of the ensemble made by the many variables belonging to a single group.
Retain uncertain variables. When this option is ON the variables with an uncertain effect on the predictive ability of the model will not be removed from the data file. See Background section for further explanation about the concept of 'uncertain variables'. In our experience and in the context of 3D-QSAR, it is advisable to retain the uncertain variables in the models.
Fold-over design. Select this option to force the factorial design to "fold-over". This means that all the variable combinations (PLS models) will be repeated, inverting the pattern of signs in the combination matrix. As a result the effect of the variables (or groups of variables) on the model predictive ability is evaluated in a much safer way, because the design contains less confoundings. However the procedure will take twice the time it takes in a standard procedure. The effect of fold-over on the quality of the variable selection is further discussed in the Background section.
Combinations/variables ratio This scale controls the number of rows of the combination matrix. It is possible to calculate the number of PLS models to be tested as the smaller power of 2 higher than the number of active and dummy variables multiplied by the value shown in this scale. So, if 500 variables are present (active variables plus dummies), a value of 2 in this scale will result on testing 1024 PLS models (210) and a value of 3 will result on testing 2048 (211) PLS models.
Increase the value of this scale produce better estimations of the effects of the variables on the model predictive ability, but will also slow down the computation. The default value is 2.0.
Dummies Some of the columns of the combination matrix are labelled as "dummy variables" in order to evaluate the noise level in the model. The radio buttons controls the percentage of dummies to include in the combination matrix. The effect of these variables on the size of the combination matrix is described above. Our suggestion is to add a 20% of the number of active X-variables: this is a good choice for most of the cases.
CPU priority Please move the scale to the right to execute the calculation with a lower CPU priority (more "nice", using UNIX terminology). A lower CPU priority might be preferable when the computer is doing many others jobs in background.
The options are:
When all the settings are correct press the OK button and the FFD selection will be started. Press the Cancel button to abort or the Defaults button to fill all the settings with the default values. A few seconds after the selection starts, the program gives an estimation of the time required to complete the calculations. If the process is running in background the status of the process can be inspected in the file namefile.dat.FFDlog; if the process is running in a independent window, the information will be displayed in that window. Moreover, a file named FFD.csh will be created, containing a shell script useful for running the F. Factorial selection on a different computer. See appendices for further information.
IMPORTANT: F. Factorial variable selections usually build and validate several thousand of PLS models. It may take hours to complete the calculation.
Once the calculation is finished the results can be accessed using the Pretreatment>>>Delete unselected var.s (F. Factorial) command.
IMPORTANT: In order to understand all the concepts mentioned here we strongly encourage the User to look first at the Background section.
Selects the dimensionality of the parametric space in which the procedure looks for the seeds. It should be carefully chosen so that it contains enough informative components but not spurious information. It should be pointed out that often it is appropriate to select a low number of components, say two or three, to take into account only really major effects.
Selects the space in which the procedure looks for the seeds. It is possible to choose between PCA loadings, PLS loading space or partial weights (W's) space. Our advice is to use PLS partial weights.
Number of seeds
Total number of variables from which the procedure builds Voronoi polyhedra. The default is the smallest of these three numbers: 10% of the number of grid points, half of the number of active variables or 3000. The given default is usually a good choice.
In our experience, the best approach consists in defining many seeds (at least the number defined as default), so building many polyhedra, and collapsing them later. If no collapsing will be applied the use of many groups will improve only slightly the selection of variables.
The procedure of grouping will build Voronoi polyhedra around the seeds. The algorithm assigns to a given polyhedron all the variables which are closer to its seed than to any other polyhedra's seed. Nevertheless, the variables which are farther than the critical distance from the seed will not be assigned to this group, but to group 0 which defines inactive variables. Note that, in this context, the distances are real euclidean distances in the grid space around the molecules.
Setting here a large critical distance will result in including in the polyhedra variables far away from their seed, not really related to it. On the other hand, defining a very short distance will put a lot of variables in the group 0 (inactive variables). The default distance (1.0) is fairly short and will include in the polyhedra only variables neighboring to the seeds. This default works well when the number of seeds is high, but when the number of seed is smaller than it is better to increase the critical distance.
Select yes to actually collapse the Voronoi polyhedra or no not to collapse them.
It should be stated that this part of the program is extremely RAM demanding. If the computer starts to use swap memory, it is better to reduce the number of seeds. Alternatively, increase the RAM of the computer.
The collapsing algorithm will try to collapse (to put together) only Voronoi polyhedra whose seeds are not farther than this collapsing distance. If the collapsing distance is very short, the algorithm will try to associate only neighbor polyhedra and will work very fast. On the other hand, a large distance will make applicable for collapsing even polyhedra on opposite corners of the grid. Provided that the collapsing algorithm is very conservative there is nothing wrong with this, but the computation will slow down significantly.
When all the settings are correct press the OK button and the grouping of variables will be started. Press the Cancel button to abort or the Defaults button to fill all the settings with the default values. After a while, the groups will be generated and ready to be used in the F. Factorial variable selection. To inspect and visualize the groups use the commandPlot>>>Grid-plot>>Groups of variables.
This menu contains commands related with the generation and handling of regions. In GOLPE the word "region" means a cubic area inside of the grid cage. Regions are conceptually different from groups, because in regions the variables are selected only on the basis of their position in the grid cage and in the groups a more complex procedure is used which takes into account the best compromise between chemical information and statistical information.
X-selection>>>Region Selection>>>Generate Q2-GRS model
Use this option as a first step to generate a q2-GRS like model. See the paper of Thropsha at al. for further details (Chao, S. J.; Thropsha, A. Cross-Validated R2-guided Region Selection for Comparative Molecular Field Analysis: A Simple Method To Achieve Consistent Results. J. Med. Chem. 1995, 38, 1060-1066). In the current version GOLPE can generate q2-GRS like models only for data files containing just one Y-variable.
First the User is prompted to enter the number of divisions per axis.
The total number of regions generated will depend of this value (N). GOLPE will generate NxNxN regions, and every field variable will be assigned to one of these regions following only a geometrical criterion. When there is more than one field block of variables, the variables placed in equivalent positions for every field block will be included in the same regions.
Press the right arrow button to increase the number of divisions per axis or the left arrow button to decrease the number of divisions per axis. When the dialog window shows the desired number press the OK button, or press the Cancel button to abort the operation.
Then it will appear a new dialog window to introduce the model validation parameters.
Remember that for each of these regions GOLPE will build and validate a PLS model. In this dialog window it is defined the maximum dimensionality of the PLS model and how to make the validation. The meaning of the elements in this dialog window is described in Model>>>Validate PLS model. Please refer to this section for further reference.
When the OK button is pressed, the q2-GRS analysis is started. GOLPE will build NxNxN models of the given dimensionality and validate them with the given validation method and parameters. GOLPE informs of the progress of the calculation with a working dialog where the number of boxes processed are shown.
IMPORTANT: This calculation is very time consuming. Calculation will take many hours in most cases. Please note that GOLPE does not provide any method to run this job in background, so if the application is closed or the User logs out while in progress, the calculation will be stopped.
The results from these calculations will be shown in the main window and also stored in four files:
|filename.dat.q2||q2 values for each box and each dimension of the PLS model.|
|filename.dat.qsdep||SDEP values for each box and each dimension of the PLS model.|
|filename.dat.qsdev||SDEV (sdep) values for each box and each dimension of the PLS model.|
|filename.dat.q2max||Optimum dimensionality and maximum value of Q2 for each box.|
The first three files have the following format:
1: -0.1480 -0.2027 -0.2338 -0.1213 2: -0.1480 0.0376 0.0455 0.0131 3: -0.1480 -0.1996 -0.3505 -53.7029 4: -0.1480 0.0298 -0.2663 0.4346 5: -0.1480 -0.2033 -0.1913 0.0565 6: -0.1480 0.0165 0.0290 0.0509 7: -0.1480 -0.2159 -0.3496 -46.6006
Each row represents a box and the first number indicates its sequential index. The values after the colon, correspond to the q2, SDEP or SDEV values (depending of the file) for each dimension of the PLS models.
The fourth file (filename.dat.q2max), uses a slightly different format:
1: 3 -0.121292 2: 2 0.045453 3: 0 -0.147959 4: 3 0.434596 5: 3 0.056519 6: 3 0.050947 7: 0 -0.147959 ...
Each row represents a box and the first number indicates its sequential index. The first value after the colon is the optimal dimensionality of the PLS model (the dimensionality for which the q2 value is the highest). The second column of values corresponds to the q2 of the PLS model.
The format of the data displayed in the main window for each box is quite similar to the format of the output after a model validation. For each box it is shown:
For further details about the information presented see Model>>>Validate PLS model.
Once this calculation has been completed, in order to generate the Q2-GRS model, the next step is putting together the boxes with highest q2 and building a new model with all those boxes. The command X-selection>>>Region Selection>>>Q2-GRS region selection can be used for this task.
X-selection>>>Region Selection>>>Q2-GRS region selection
IMPORTANT: This command can be accessed only after a q2-GRS model has been generated. See X-selection>>>Region Selection>>>Generate Q2-GRS model.
A dialog window like this will appear:
List containing the boxes selected for building the final model. For each box the list shows: the box's index, the optimal dimensionality of the PLS model and the maximum q2 obtained. GOLPE include automatically in this list all the boxes for which a q2 of 0.2 or higher was obtained. To remove a box from this list simply click on the item. To add a box not present in this list click on the corresponding item in the list of Unselected regions and the selected box will be immediately included in the Selected regions list.
List containing the boxes not selected for building the final model. For each box, the list shows: the box's index, the optimal dimensionality of the PLS model and the maximum q2 obtained.
When the Selected regions list contains all the desired boxes, press the OK button. GOLPE will remove all variables not included in this boxes in the same way it does when it delete unselected variables. They are not actually deleted, but simply marked as inactive and ignored in further analysis. All these variables will be ignored by GOLPE until the command Pretreatment>>>Reload original variables will be executed or the data file will be reloaded from disk.
The selected boxes and the number of X-variables that remain active will be shown in the main window for reference.
X-selection>>>Region Selection>>>Generate regions
This command allows the User to define small cubic boxes similar to the boxes used in the q2-GRS procedure. In a later step, these regions can be used by F. Factorial variable selection procedure in the same way that the groups of variables generated by X-selection>>>Generate groups. This procedure can be regarded as a simpler and cheaper alternative to generate groups, but in most cases the grouping produce much better models. See the Background section for details.
It appears a dialog window in which the User can select individually the number of divisions for each axis. GOLPE will generate XxYxZ boxes, when X, Y and Z are the number of X, Y and Z divisions selected by the User, respectively. The default is 5 for all three axis.
When the right number of divisions is selected press the OK button. Press the Cancel button to abort the operation or Defaults to select 5 divisions for all the axis. The regions will be immediately generated and ready to be used in the F. Factorial variable selection. To inspect and visualize the regions use the command Plot>>>Grid-plot>>Groups of variables.