Cross validation in discriminant function analysis

Overview

Cross validation is the process of testing a model on more than one sample. This technique is often undertaken to assess the reliability and generalisability of the findings. Cross validation can be executed in the context of factor analyses, discriminant function analyses, multiple regression, and so forth. This process is particularly crucial in discriminant function analysis, because the solutions are often unreliable. This document describes how SPSS can be utilised to cross validate the output derived from discriminant function analysis.

Purpose of discriminant function analysis

This document assumes the reader possesses some knowledge of discriminant function analysis. Readers without this background will still be able to understand the crux of this document, provided they appreciate the objectives of discriminant function analysis. In particular, discriminant function analysis is used to ascertain whether or not a set of variables differ significantly across the groups as well as construct equations that can be utilised to classify people into groups.

For example, a researcher explores whether or not extroversion, neuroticism, and self esteem differ amongst Christians, Muslims, and Sihks. To this end, the researcher assesses the level of extroversion, neuroticism, and self esteem in 50 individuals of each religion. The data are subjected to discriminant function analysis, which reveals that personality differs across these religions. In addition, discriminant function analysis creates a set of equations that can be used to predict a person's religion from their personality profile.

Importance of cross-validation in discriminant function analysis

Discriminant function analysis provides a wealth of output, including

• Discriminant loadings, which effectively reflect the extent to which each predictor differentiates the groups
• Fisher coefficients, which are utilised to create the equations that classify new individuals into groups.

Unfortunately, this output may not be stable or generalisable. The following scenario illustrates the notion of stability and generalisability in this context. Suppose you subdivided your sample into two subsamples: males and females. You then subjected each gender to discriminant function analysis. Assume that males and females provide similar findings. Hence, in this case, the output is both reliable--that is, stable--and externally valid--that is, generalizable across genders.

In contrast, suppose you discover that males and females provide disparate findings. For instance, only extroversion differentiates the religions in males, whereas only neuroticism differentiates the groups in females. Accordingly, the output may be unreliable or not externally valid.

Selecting the groups

Thus far, the function of cross validation in discriminant function analysis has been described. The remainder of this document outlines how to undertake cross validation through SPSS.

The first step is to divide the original sample into two subsamples. Two approaches have been distinguished. The first approach divides the sample in a random fashion. When each subsample yields consistent information, the findings are deemed to be reliable but not necessarily generalisable.

The second approach divides the sample according to some crucial variable. For instance, you may divide the sample into males and females, staff and managers, novices and experts, and so forth. When each subsample yields consistent information, the findings are deemed to be reliable and generalisable across that variable. This approach is adopted when the researcher suspects the findings may not apply to all levels of some variable.

Before undertaking the discriminant function analysis, you must distinguish the two subsamples in the SPSS data sheet. In particular, you should

Create a new column in the data sheet. Label this column with a sensible name (e.g. "subgroup")

In this column, assign a '1' to those participants who pertain to the first group& assign a '2' to the remaining participants.

The individuals assigned a '1' are called the analysis sample. This sample is utilised to derive the discriminant functions, coefficients, and loadings. The individuals assigned a '2' are called the holdout sample. The researcher must determine whether or not the discriminant functions, which emerged from the analysis sample, correctly classify the individuals in the holdout sample. A high percentage of correct classifications validates the findings.

Execution of the analysis

The previous section described the process of assigning participants to subsamples. In addition, the function of each subsample was defined. To execute this cross validation process, you should

• Undertake the same steps as a conventioanl discriminant function analysis but do not press 'OK'. Press 'Paste'. This process creates a syntax file that represents the instructions for this procedure.
• Prior to the line '/ANALYSIS ...', type '/SELECT = subgroup(1)'. This subcommand instructs SPSS to apply discriminant function analysis to the individuals in which subgroup = 1.
• Just before the full stop at the end, type '/CLASSIFY = unselected'. Remove any other lines which begin with '/CLASSIFY'. This subcommand instructs SPSS to ascertain the extent to which the estimates derived from the first subsample apply to the second subsample.
• In the line that begins with '/STATISTICS' add the word 'TABLE' at the end. Alternatively, add a line "/STATISTICS TABLE" if no line begins with the '/STATISTICS'

Execute this syntax by highlighting the text and then pressing the 'Play' button or using the 'Run' menu.

Interpreting the output

The output will first provide information about the functions. This information is provided in a series of tables called 'Canonical discriminant functions', 'Standardized.. coefficients', 'Structure matrix', and 'Group centroids'. All of this information was derived from the first or analysis sample.

The output will then provide information about the second or hold out sample. This information is provided in a table called 'Classification results for cases not selected'. In particular, SPSS uses information from the analysis sample to classify individuals in the holdout sample.

When the output derived from the analysis sample is stable and, if applicable, generalisable, most of the individuals should be correctly classified. For instance, the percentage of cases correctly classified may exceed 80%. When the output derived from the analysis sample is unstable or not generalisable, many of the individuals will not be correctly classified. For instance, the percentage of cases correctly classified may be only 40%.

Press' Q test

To reiterate, SPSS derives the discriminant functions and so forth from the first or analysis sample. This output is then used to classify individuals in the second or holdout sample. The percentage of cases that are correctly classified reflects the degree to which the samples yield consistent information.

The question, then is what proportion of cases should be correctly classified? This issue is more complex than many researchers acknowledge. To illustrate this complexity, suppose that 75% of individuals are Christian, 15% are Muslim, and 10% are Sikhs. Even without any information, you could thus correctly classify 75% of all individuals by simply designating them all as Christian. In other words, the percentage of correctly classified cases should exceed 75%.

Nonetheless, a percentage of 76% does not indicate that classification is significantly better than chance. To establish this form of significance, you should invoke Press' Q statistic. In particular, compute this statistic, which equals

NH - (pc x NH x k)2 / NH x (k-1),

where NH denotes the number of individuals in the holdout sample, pc denotes the proportion of cases correctly classified (ranging from 0 to 1), and k denotes the number of groups.

Compute the critical value, which equals the chi-square value at 1 degree of freedom. You should probably let alpha equal 0.05. When Q exceeds this critical value, classification can be regarded as significantly better than chance, thereby supporting cross-validation.

Size of each subsample

To reiterate, cross validation entails the formation of two subsamples. When the subsamples are formed in a random fashion, the question of sample size needs to be resolved. Some researchers assign 50% of the participants to the analysis sample and 50% of the participants to the holdout sample. Other researchers assign 75% of the participants to the anlaysis sample and 25% of the participants to the holdout sample. Regardless of you choice, ensure that

• The analysis sample comprises enough individuals to yield reasonable estimates (e.g., 15 to 20 times the number of predictors).
• The holdout sample comprises enough individuals to ensure the assumptions that underlie Press's Q statistic are upheld (e.g. 10 times the number of groups)
• The holdout sample comprises enough individuals to ensure that Press's Q is sufficiently powerful.

To ensure these conditions are satisfied, many researchers first multiple the number of groups by 15. Then, they ensure the holdout sample comprises this number of cases. The remaining cases are assigned to the analysis sample.

References

Hair Jr, J. F., Anderson, R. E., Tatham, R. L., & Black, W. C. (1995). Multivariate data analysis with readings. Englewood Cliffs, NJ: Prentice Hall.

•    Treat Premature Ejaculation
Online C-CBT treatment
The best solution at an incredible price - don't miss it