Expectation maximization is an effective technique that is often used in data analysis to manage missing data (for further discussion, see Schafer, 1997;; Schafer & Olsen, 1998). Indeed, expectation maximization overcomes some of the limitations of other techniques, such as mean substitution or regression substitution. These alternative techniques generate biased estimatesand, specifically, underestimate the standard errors. Expectation maximization overcomes this problem.
Many statistical packages can now implement expectation maximization. To execute this technique with SPSS
To illustrate expectation maximization, consider the following extract of data. Missing values are observed for depression, age, and height.
ID 
depression 
age 
height 
wage 
1 
5 
32 
32, 010 

2 
17 
173 
31, 600 

3 
7 
169 
48, 020 

4 
5 
24 
186 
17, 400 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
100 
4 
45 
201 
7, 800 
To undertake expectation maximization, the software package, such as SPSS executes the following steps. First, the means, variances, and covariances are estimated from the individuals whose data is complete. In particular, the computer would generate the following information. Specifically:
depression 
age 
height 
wage 

depression 
3.55 

age 
7.42 
9.43 

height 
184.42 
1643.32 
194.43 

wage 
43042.345 
143254.43 
14425.54 
14403.12 
Mean 
4.71 
37.50 
183.21 
45504.43 
Second, maximum likelihood proceduresa special class of formulasare used to estimate a regression equations that relate each variable to each other variable. For example, these procedures might generate the formula:
The maximum likelihood procedures are designed to ensure these formulas predict the means, variances, and covariances more accurately than any other formulas (see Dempster, Laird, & Rubin, 1977). That is, suppose the researcher could calculate the probability of generating these means, variances, and covariances if these equations were correct. Suppose the probability is approximately .00004. Any other formulas would generate lower probabilities.
Third, these formulas can be used to estimate the missing values. To illustrate:
The same process can be used to estimate the missing values associated with the other variables. This process could generate the following data. The estimated data appear in bold
ID 
depression 
age 
height 
wage 
1 
5 
32 
181.43 
32, 010 
2 
1.362 
17 
173 
31, 600 
3 
7 
19.53 
169 
48, 020 
4 
5 
24 
186 
17, 400 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
100 
4 
45 
201 
7, 800 
Using these data, the means, variances, and covariances are then estimated again. As the following table shows, these estimates might change slightly because more data is included.
depression 
age 
height 
wage 

depression 
3.35 

age 
7.72 
10.01 

height 
182.42 
1743.82 
194.41 

wage 
43019.315 
125254.93 
15125.51 
14353.11 
Mean 
4.91 
37.87 
179.29 
45504.45 
Again, the regression equations are calculated again, using maximum likelihood procedures. These equations might now be marginally different:
ID 
depression 
age 
height 
wage 
1 
5 
32 
182.93 
32, 010 
2 
1.291 
17 
173 
31, 600 
3 
7 
20.01 
169 
48, 020 
4 
5 
24 
186 
17, 400 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
100 
4 
45 
201 
7, 800 
This sequence of processesthe calculation of means, variances, and covariances, the formulation of regression equations, and the estimate of missing valuesis undertaken iteratively. By default, SPSS engages in this process up to 25 times, until the estimates change only negligibly. This default can be increased, however.
One of the problems with many techniques, designed to estimate missing values, is the standard error diminishes. To illustrate, consider the estimate of depression for the second person. This estimate was derived from his or her age, height, and wage. Suppose many estimates of depression were derived from the age, height, and wage of the participants. As a consequence, the extent to which depression is related to age, height, and wage would be overestimated.
This process, therefore, disregards the possibility that depression does not only depend on
age, height, and wage. Many other random factors could increase or decrease these estimates of depression.
Therefore, to ensure the estimates are more realistic, the software package introduces some error to the variances and covariances. That is, rather than generate the table...
depression 
age 
height 
wage 

depression 
3.35 

age 
7.72 
10.01 

height 
182.42 
1743.82 
194.41 

wage 
43019.315 
125254.93 
15125.51 
14353.11 
Mean 
4.91 
37.87 
179.29 
45504.45 
...from the data, some of these values are modified slightly to...
depression 
age 
height 
wage 

depression 
3.34 

age 
7.71 
10.00 

height 
182.41 
1743.81 
194.42 

wage 
43019.312 
125254.94 
15125.53 
14353.10 
Mean 
4.91 
37.87 
179.29 
45504.45 
As this table shows, the modification is minoronly affecting the final decimal point in this example. By default, in SPSS, the distribution of these errors follows a normal distribution. Other alternatives, however, can be specified, such as mixed normal, and Student's t, both of which require specification of some parameter.
Expectation maximization is applicable whenever the data are missing completely at random or missing at randombut unsuitable when the data are not missing at random. To illustrate, consider the following extract of data. Conceivably, individuals who do not answer questions about depression tend to be very depressed. In other words, the likelihood of missing data on this variable is related to their level of depression.
In addition, individuals who do not answer questions about depression might be olderbecause the stigma if this affective disorder might be more potent in an older generation. Thus, the likelihood of missing data on depression is related to their level of age.
ID 
depression 
age 
height 
wage 
1 
5 
32 
32, 010 

2 
17 
173 
31, 600 

3 
7 
169 
48, 020 

4 
5 
24 
186 
17, 400 
. 
. 
. 
. 
. 
. 
. 
. 
. 
. 
100 
4 
45 
201 
7, 800 
Suppose the missing data on one variable, such as depression, is unrelated to their actual level on this variableor to their level on the other measured variables such as age, height, or weight. In this instance, researchers designate the data as missing completely at random, and expectation maximization is applicable.
Suppose, instead, the missing data on one variable, such as depression, is related to their level on the other measured variables, such as age, height, or weight. However, once these variables are controlled, suppose that missing data on one variable, such as depression, is unrelated to their actual level on this variable. For example, perhaps individuals who do not answer questions about depression might be older. Once age is controlled howeveranalogous to examining one age group onlymissing data on depression might be unrelated to depression. In this instance, researchers designate the data as missing at randomnot missing completely at randomand expectation maximization is still applicable.
Sometimes, however, missing data on one variable, such as depression, is still related to scores on that variable after the other factors are controlled. That is, the most depressed individuals might be least likely to answereven if only one age, height, and wage is examined. In this instance, researchers designate the data as not missing at random, and expectation maximization is no longer applicable.
Several procedures can be undertaken to establish whether the data are missing completely at random, missing at random, and not missing at random. First, for each variable, researchers can assess whether the data differs between individuals who responded to some variable and individuals who did not respond to some variable.
For example, a series of ttests or a logistic regression analysis can be undertaken to assess whether individuals who generated responses on the depression scale and individuals who did not generate responses on the depression scale differ on age, height, or weight. Nonsignificant findings indicate that, perhaps, missing data on this variable is random& otherwise, at least one variable should differ between individuals who responded to this variable and individuals who did not respond to this variable.
Similarly, using SPSS or other packages, individuals could calculate Little's MCAR test. A nonsignificant finding is consistent with the assumption that data are completely missing at randomand hence expectation maximization is applicable. To conduct this test, undertake expectation maximization as usual, and the test will appear by default.
If the data are not missing completely at random, they might nevertheless by missing at random. To establish this possibility, undertake expectation maximization as usual. Proceed to the table labelled Separate Variance t Tests. If all the p values exceed .05 or alpha, the data are missing at random. Expectation maximization is thus warranted.
Allison, P. D. (2001) Missing Data Thousand Oaks, CA: Sage Publications.Return
Cohen, J. & Cohen, P., West, S. G. & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.) Mahwah, N.J.: Lawrence Erlbaum. Return
Dempster, A., Laird, N., & Rubin., D. (1977). Likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39,138
Little, R.J.A. & Rubin, D.B. (1987) Statistical analysis with missing data. New York, Wiley. Return
Jones, M. P. (1996). Indicator and stratification methods for missing explanatory variables in multiple linear regression. Journal of the American Statistical Association, 91,222230.
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91, 473489.
Scheuren, F. (2005). Multiple imputation: How it began and continues. The American Statistician, 59, 315319.
Schafer, J. L. (1997) Analysis of incomplete multivariate data. Chapman & Hall, London. Book No. 72, Chapman & Hall series Monographs on Statistics and Applied Probability.
Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8, 315.
Schafer, J. L. & Olsen, M. K. (1998). Multiple imputation for multivariate missingdata problems: A data analyst's perspective. Multivariate Behavioral Research, 33, 545571.
Last Update: 6/26/2016