# Two-stage least squares

### Author: Dr Simon Moss

Recall that multiple regression assumes that:

Dependent variable = B0 + B1 x Predictor1 + B2 x Predictor2 + ... + Error

where the B values represent constants. The 'Error' term, sometimes called residuals, represents the unknown determinants of the dependent variable. For instance, suppose that self esteem was the dependent variable and that age, gender, and height were the predictors or independent variables. In this case, the error term may reflect IQ, extroversion, and all the other determinants of self esteem.

Multiple regression assumes the error term is uncorrelated with the predictors. In the previous example, for instance, the error term, in theory, probably encompasses extroversion, which may correlate with some of the predictors, such as gender and height.

This violation is immaterial when the error term is not highly related to the dependent variable. In this example, the multiple regression is still valid provided that extroversion is not a vital determinant of self esteem. Likewise, this violation is inconsequential when the error term is not appreciably related to any predictors. Again, in this example, the multiple regression is still valid provided that extroversion does not vary appreciably with age, gender, or height.

Nonetheless, in some cases, the error term is considerably related to the dependent variable and one, or even more, predictor variables. Under these circumstances, the estimates of these B values are biased, and hence the conclusions derived from the multiple regression are suspect.

#### Ensuring the assumption is satisfied.

To ensure that violations of this assumption are trivial, some researchers attempt to minimize the error term. That is, they measure and utilize as many predictors as possible. When many predictors are incorporated into the equation, the error term tends to diminish, and the consequences of violating this assumption are tempered.

This strategy, however, creates two obvious problems. First, more variables need to be measured, which is obviously costly and inconvenient. Second, when many variables are entered into the equation, the power associated with each predictor is attenuated.

To circumvent these shortcomings, the researcher needs to be more selective. In particular, they should not simply enter all the variables that could possibly influence the dependent variable. Instead, researchers should only enter variables that satisfy either of these criteria:

• Predictors that are relevant to the research objectives
• Variables that correlate appreciably with the both the dependent variable and one, or more, of these predictors.

### Two-stage least squares regression

The previous discussion demonstrated how to ensure that error terms are virtually uncorrelated with the dependent variable. Nonetheless, sometimes this undesirable correlation cannot be avoided, especially when predictors the are potentially vital cannot be measured. The issue, then, is how to minimize the impact of this undesirable correlation.

A technique, called two-stage least squares, has been devised to minimize the deleterious impact of this violation. If you have access to SPSS, you should complete the following steps.

• First, identify variables that you did not measure but probably influence the dependent variable, such as extroversion and IQ in the previous example.
• Second, identify the predictors--that is variables you measured--that probably do not correlate appreciably with any of these unmeasured variables. These predictors are called 'Instrumental variables' (see Angrist & Krueger, 2001).
• Third, identify the predictors that probably correlate appreciably with one or more of these unmeasured variables. These predictors are called 'Explanatory variables'.
• Ensure the number of explanatory variables does not exceed the number of instrumental variables.
• In SPSS, select 'Analyze', 'Regression' and 'Two stage least squares'. Specify the dependent, instrumental, and explanatory variables in the appropriate boxes as shown below, and then press OK.

The output, as presented below, is reasonably simple to follow provided you understand multiple regression analysis. The last table presents B values for the explanatory variables only.

#### Rationale underlying two-stage least squares.

To ensure the error term does not correlate with the predictors, two-stage least squares invokes a simple rationale. In particular, each predictor can be conceptualized as comprising two components. The first component is correlated with the error term, and the second component is not correlated with the error term. SPSS attempts to extricate the uncorrelated component. In other words, only the component that is not correlated with the error term is entered into the analysis.

The question, however, is how does SPSS extricate this component? Essentially, SPSS undertake a preliminary set of multiple regression analyses. These analyses predict the explanatory variables from the instrumental variables. In other words, for each explanatory variable, SPSS estimates the B values in this equation:

Explanatory variable = B0 + B1 x Instrumental1 + B2 x Instrumental2 + ... + Error

These equations can then be used to predict the explanatory variables from the instrumental variables. These predicted explanatory variables are exclusively a function of the instrumental variables which, in turn, are uncorrelated with the error term. Accordingly, the predicted explanatory variables are also uncorrelated with the error term, and hence the problem is solved (for a broader discussion, see Angrist & Krueger, 2001).

### References

Angrist, J., & Krueger, A. (2001). Instrumental variables and the search for identification: From supply and demand to natural experiments, Journal of Economic Perspectives, 15, 69-85.