Modeling

Alt-M

 

Modeling Menu

[Generate PCA model][PCA Rank Validation][Generate PLS model][Validate PLS model][Generate CPCA model]

 

In the Modeling menu the User can find the commands to perform Principal Components Analysis (PCA), validate the rank of PCA models and to generate and validate Partial Least Squares (PLS) models.

 


 

Modeling>>>Generate PCA model

Alt-C

 

PCA is carried out on the whole X-matrix. Variables will be automatically pretreated as defined in the Pretratment>>>Classic Pretreatment>>>Set-up pretreatment. Even if no set-up was performed to, a default pretreatment (which leaves the data unchanged) will be applied to the data.

 

The number of PC's calculated for the model can be selected by the User.

 

PCA Dialog

 

Press the right arrow button to increase the number of components or the left arrow button to decrease the number of components. When it appears in the dialog window the desired model dimensionality press the OK button, or press the Cancel button to abort the operation. Remember that, in order to be safe, the number of components should always be smaller than one third of the number of the objects

 

 

The calculation will take from a few seconds to several minutes, depending on the number of X-variables and the number of objects in the data file. GOLPE will inform on the progress with a working dialog in which the number of components processed are shown.

 

After a while GOLPE will display in the main window the results of the PCA:

 

 

Principal Component Analysis (PCA)   24 objects     24 X-var
      components    XVarExp     XAccum
          1        59.7849     59.7849
          2        39.9664     99.7513
          3         0.1676     99.9188
          4         0.0466     99.9654
          5         0.0250     99.9905

 

 

For each component it is shown:

 

XVarExp Percentage of the X-matrix variance explained by this component.

 

XAccum Accumulative percentage of the X-matrix variance explained by the model.

 


 

Modeling>>>PCA Rank Validation

 

Quite often, when making a PCA, it would be interesting to know how many PC's are actually significative. There is no simple answer to this question and even the definition of "significative" can be open to discussion. GOLPE incorporates a crossvalidation technique for the assessing the significance of sucesive dimension in PCA models.

The method works dividing the dataset randomly into G groups. Each group consist on a set of values extracted regularly from the matrix as in the table:

 

1 2 3 4 5 1 2 3 4
5 1 2 3 4 5 1 2 3
4 5 1 2 3 4 5 1 2
3 4 5 1 2 3 4 5 1
2 3 4 5 1 2 3 4 5
1 2 3 4 5 1 2 3 4

 

Then the first groups (G1) is taken out, computing a reduced model which is used to "predict" the values for the objects in the deleted group. The error in the prediction is measured in terms of a sum of squares of the prediction errors (PRESS) for this reduced models. The whole procedure is repeated removing G2, G3... and accumulating all the partial prediction errors in a total PRESS. The value of this error is compared with the data sum of squares (Seps) as

R=PRESS/Seps

where Seps, for dimension a, is computed as the sum of squares of the X matrix after removing the variance explained by the previous (a-1) PC's.

The R value is calculated for every model dimensionality. When the value of R obtained is larger than 1.00, the incorporation of this PC does not improve the predictions and therefore this PC should not be included.

A detailed description of the method can be found in: S. Wold, Cross-Validatory Estimation of the Number of Components in Factor and Principal Component Models, Technometrics 20, 397-405 (1978).

 

 

This command can be accessed only after a PCA model has been generated. It opens a dialog like this:

 

PCA Rank Validation Dialog

 

Max. dimensionality

Max dimensionality of the PCA model that will be validated.

 

Validation Groups

Number of validation groups using in the procedure. It can be set between 4 and 7 but avoiding the numbers which are exact divisors of the numbers of variables.

 

Press OK to start the validation or or Cancel to start the computation. The Defaults button will load the pre-set values. The calculation will take from a few seconds to several minutes, depending on the number of X-variables and the number of objects in the data file. GOLPE will inform on the progress with a working dialog where is shown the percentage of the work completed.

 

After a while GOLPE will display in the main window the results of the PCA validation:

 

 

PCA Rank Validation - using 5 random groups
      components     PRESS         Seps           R
          1         6.1742e+05    8.3696e+05      0.7377
          2         4.5163e+05    6.0483e+05      0.7467
          3         3.5871e+05    4.4333e+05      0.8091
          4         2.9929e+05    3.4783e+05      0.8604
          5         2.6812e+05    2.9408e+05      0.9117

 

PRESS Sum of squares of the errors of the PCA predictions, computed as explained above.

 

Seps Sum of squares of the data computed as explained above.

 

R Ratio PRESS/Seps. A component is considered significative when R<1.0

 

However, the validity of the method is relative. If the User is making the PCA mainly for visualizing the data, an obvious limit to the complexity of the model are 3 PC's, since more complex models would be difficult to represent. Moreover, datasets containing outlayers and or clusters of objects can produce misleading results. Therefore our advise is to apply common sense and take the results of this test only as a hint for selecting the right dimensionality of the model.


 

Modeling>>>Generate PLS model

Alt-G

 

This command generates the PLS model in fitting, i.e. all available objects (molecules) are used to build the model. The item in the menu is insensitive when the data file does not contains Y-variables.

 

GOLPE will use the whole X-matrix and all the variables defined as Y's. Variables will be automatically pretreated as defined in the Pretreatment>>>Classic Pretreatment>>>Set-up pretreatment. Even if no set-up was performed to, a default pretreatment (which leaves the data unchanged) will be applied to the data.

 

The number of PC's calculated for the model can be selected by the User.

 

PLS Dialog

 

Press the right arrow button to increase the number of components or the left arrow button to decrease the number of components. When the desired model dimensionality appears in the dialog window, press the OK button, or press the Cancel button to abort the operation. Remember that, in order to be safe, the number of components should always be smaller than one third of the number of the objects

 

 

The calculation will take between a few seconds to several minutes, depending on the number of X and Y-variables and the number of objects in the data file. GOLPE will inform on the progress with a working dialog window where the number of components processed are shown.

 

After a while GOLPE will display in the main window the results of the PLS:

 

Partial Least Squares        (PLS)   15 objects     449 X-var     1 Y-var
Y1    components    XVarExp     XAccum      SDEC       r2
          0         0.0000      0.0000     1.0675     0.0000
          1        18.7309     18.7309     0.5703     0.7146
          2        12.7664     31.4973     0.4179     0.8468
          3        19.7530     51.2503     0.3586     0.8871
          4        10.4417     61.6920     0.3052     0.9183
          5        14.4762     76.1682     0.2760     0.9331

 

 

For each component it is show:

 

XVarExp Percentage of the X-matrix variance explained by this component.

 

XAccum Accumulative percentage of the X-matrix variance explained by the model.

 

SDEC Standard Deviation of Error of Calculations.

 

r2 Squared Correlation coefficient.

 

Y : Experimental value

Y' : Value calculated by the model

: Average value

N : Number of objects


 

Modeling>>>Validate PLS model

Alt-V

 

The way of validating of PLS models is one of the most important features of GOLPE. This command can be accessed only after a PLS model in fitting has been generated.

 

Model Validation Dialog

 

Max. dimensionality

Selects the maximum dimensionality of the PLS model to validate. The optimal dimensionality of the model may be less or equal to this maximum dimension number.

 

Validation mode

Select the crossvalidation method reported in order to validate the model. It is possible to choose between:

 

 

Only in this last option, immediately after pressing the OK button, the User will be prompted to define the groups in a dialog window like this:

 

Groups Dialog

 

The User should proceed as follows:

 

 

When all the objects were assigned to a group, press the OK button, to proceed with the validation, or Cancel to abort it.

 

Num. of SDEP

This scale is sensitive only when the option Random Groups is selected. The number shown in the scale indicates the number of times that the whole validation procedure will be repeated, as it was stated above. The default is 20 times.

 

Number of groups

This control is sensitive only when the options Random Groups or Specific Groups are selected. Specifies the number of groups in which the objects in the data file will be split. We suggest using 5 groups when the number of objects is 20 or larger, and less groups when the number of objects is smaller.

 

Recalculate weights

Selecting yes will force GOLPE to recalculate the variable weights in each computation. The results are more reliable and stable although the computation is slightly slower.

 

 

When all the settings are correct press the OK button to start the computation. Press the Cancel button to abort the validation or the Defaults button to change all the settings in this dialog window with the default values. Remember that, when the validation uses selected groups, a new dialog window will appear to define the groups.

 

The calculation will take from a few minutes to several minutes, depending on the number of X and Y-variables, the number of objects and, mainly, on the validation procedure chosen. Random Groups is the most time consuming procedure, depending also of the Num. of SDEP defined. GOLPE will inform on the progress of the validation by a working dialog where the percentage of the calculation completed is shown.

After a while GOLPE will display in the main window the results of the PLS validation:

 

 

PLS Model Validation - 5 Random Groups   20 SDEP-calc
Y1    components    SDEP        SDEV(sdep)    q2
          0        1.1599      0.0417       -0.1807
          1        0.9637      0.0592        0.1850
          2        0.9217      0.0888        0.2544
          3        0.8738      0.1087        0.3300
          4        0.8607      0.0933        0.3498
          5        0.8639      0.0732        0.3451

 

For each component it is shown:

 

SDEP Standard Deviation of Error of Predictions.

 

SDEV(sdep) Standard Deviation of SDEP

 

q2 Squared Predictive correlation coefficient

 

Y : Experimental value

Y' : Predicted value

{short description of image} : Average value

N : Number of objects


 

Modeling>>>Generate CPCA model

 

When the data file contains at least two X blocks, GOLPE can generate a Consensus Principal Component Analysis (CPCA) model. Please refer to the background section for information about the particular implementation of CPCA in GOLPE.

PCA is carried out on the whole X-matrix. Variables will be automatically pretreated as defined in the Pretratment>>>Classic Pretreatment>>>Set-up pretreatment. Even if no set-up was performed to, a default pretreatment (which leaves the data unchanged) will be applied to the data.

 

The number of PC's calculated for the CPCA model can be selected by the User.

 

CPCA Dialog

 

Press the right arrow button to increase the number of components or the left arrow button to decrease the number of components. When it appears in the dialog window the desired model dimensionality press the OK button, or press the Cancel button to abort the operation. Remember that, in order to be safe, the number of components should always be smaller than one third of the number of the objects

 

 

The calculation will take from a few seconds to several minutes, depending on the number of X-variables and the number of objects in the data file. GOLPE will inform on the progress with a working dialog in which the number of components processed are shown.

 

After a while GOLPE will display in the main window the results of the CPCA:

 

Consensus Principal Component Analysis (CPCA)   14 objects     9118 X-var

   block       var         act          %SS
     1       12144         260         16.7
     2       12144        1584         16.7
     3       12144        1784         16.7
     4       12144        1826         16.7
     5       12144        1967         16.7
     6       12144        1697         16.7

 

For each block it is shown the number of variables (var), the number of active variables (act) and the percentage of the total sum of squares account by this block (%SS). Then it is presented a summary of the analysis:

 

    comp    XVarExp    XAccum       XAccum[1]  XAccum[2]  XAccum[3]  XAccum[4]  XAccum[5]  XAccum[6]
     1     27.5555    27.5555       10.2330    26.4529    33.1077    33.3900    34.0595    32.2891
     2     24.6665    52.2220       39.7205    53.6389    54.2596    54.9317    55.7070    57.0575
     3      5.7121    57.9341       49.3022    59.8720    58.6231    59.9664    60.2219    62.4670
     4      5.6438    63.5779       57.9962    65.0712    64.0167    65.0680    65.0620    65.7364
     5      4.8046    68.3825       62.1375    70.2911    68.5776    69.6260    70.8902    69.0693

 

For each principal component extracted it is shown information regarding mainly the "superblock level" of the CPCA model. This information is similar to the information obtained for a regular PCA model.

XVarExp Percentage of the X-matrix variance explained by this component.

 

XAccum Accumulative percentage of the X-matrix variance explained by the model.

 

XAccum[i] Accumulative percentage of the block i variance explained by the local model (using block scores and block loadings).

 

Then, for each block it is presented some more information regarding the block level of the CPCA model. This information is obtained using the block scores and block loadings and the percentages refers to the block variance and not to the overall X variance. Refer to the background section for a discussion of the meaning of these figures.

 

   Block [1], 12144 X-var    260 Active    16.7 %SS

   comp    XVarExp[1] XAccum[1]   
     1     10.2330    10.2330   
     2     29.4874    39.7205   
     3      9.5817    49.3022   
     4      8.6940    57.9962   
     5      4.1413    62.1375   

    ...