The Pro Way to Fix Missing Data Bias: Parameter Estimation via PROC MI EM Action

Difficulty Level

Beginner

Published on : 12/05/2020

Expert Advice

Michael
Responsable de l'infrastructure Viya.

When using the EM statement in PROC MI, always check the iteration history (ITPRINT option) to ensure convergence. If the algorithm fails to converge within the default number of iterations, it usually indicates a high proportion of missing data or a near-singular covariance matrix, which might require you to increase the MAXITER value or provide a better starting value.

The script begins by creating a dataset named `Fitness1` containing fitness measurements (Oxygen, RunTime, RunPulse). Some of these measurements are deliberately missing to simulate an incomplete data pattern. Then, the `PROC MI` procedure is called with the `nimpute=0` option to not perform imputation, but to use the EM algorithm to estimate the mean, standard deviation, and correlation matrix of the variables. The results of this estimation are stored in the `outem` table, which is finally displayed with `PROC PRINT`.

Data Analysis

Type : INTERNAL_CREATION

Data is created directly in the script via a DATA step with a `datalines` statement. The `Fitness1` dataset is therefore entirely self-contained.

1 Code Block

DATA STEP Data

Explanation :
This DATA STEP block creates the `Fitness1` table by reading the data provided via `datalines`. It defines three numerical variables: `Oxygen`, `RunTime`, and `RunPulse`. The double trailing at `@@` at the end of the `input` statement allows reading multiple observations from the same data line.

Copied!

1	DATA Fitness1;
2	INPUT Oxygen RunTime RunPulse @code_sas_json/8_SAS_Intro_ReadFile_MultiCol_@@.json;
3	DATALINES;
4	44.609 11.37 178 45.313 10.07 185
5	54.297 8.65 156 59.571 . .
6	49.874 9.22 . 44.811 11.63 176
7	. 11.95 176 . 10.85 .
8	39.442 13.08 174 60.055 8.63 170
9	50.541 . . 37.388 14.03 186
10	44.754 11.12 176 47.273 . .
11	51.855 10.33 166 49.156 8.95 180
12	40.836 10.95 168 46.672 10.00 .
13	46.774 10.25 . 50.388 10.08 168
14	39.407 12.63 174 46.080 11.17 156
15	45.441 9.63 164 . 8.92 .
16	45.118 11.08 . 39.203 12.88 168
17	45.790 10.47 186 50.545 9.93 148
18	48.673 9.40 186 47.920 11.50 170
19	47.467 10.50 170
20	;

2 Code Block

PROC MI Data

Explanation :
The `PROC MI` statistical procedure analyzes missing data in `Fitness1`. The `nimpute=0` option specifies not to create imputed datasets. The `em` statement requests the calculation of estimates based on the Expectation-Maximization (EM) algorithm, which are saved in a new table named `outem`. The `seed` option ensures reproducibility and `simple` requests basic descriptive statistics.

Copied!

1	PROC MI DATA=Fitness1 seed=1518971 SIMPLE nimpute=0;
2	em itprint outem=outem;
3	var Oxygen RunTime RunPulse;
4	RUN;

3 Code Block

PROC PRINT

Explanation :
This block displays the content of the `outem` table, which contains the estimates (means, covariances) calculated by the `PROC MI` procedure.

Copied!

1
2	PROC PRINT
3	DATA=outem;
4	title 'EM Estimates';
5	RUN;
6

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Difficulty Level

Published on : 12/05/2020

Expert Advice

Data Analysis

1 Code Block

2 Code Block

3 Code Block

Related Documentation