Permutation (Randomization) Test for Mean Comparison

This code is also available in: Deutsch Español Français
Difficulty Level
Beginner
Published on :
This script performs a randomization test (permutation test) to evaluate the age difference between fired and non-fired employees. It first calculates the observed difference via PROC TTEST. Then, it uses PROC IML to generate 1000 random permutations of the data. These permuted datasets are mass-analyzed by PROC TTEST to construct the null distribution of the mean difference. Finally, an empirical p-value is estimated by comparing the observed statistic to the generated distribution.
Data Analysis

Type : CREATION_INTERNE


Data is created directly within the script via a DATA step using the DATALINES statement (Status and Age variables).

1 Code Block
DATA STEP Data
Explanation :
Creation of the initial 'discriminate' dataset containing individual status and age via internal data (datalines).
Copied!
1DATA discriminate;
2 INPUT STATUS Age;
3 
4/* Status = 0 = Fired
5 Status = 1 = Not Fired */
6 
7DATALINES;
80 34
90 37
10...
111 54
12;
13RUN;
2 Code Block
PROC TTEST
Explanation :
Execution of Student's T-test on real data to obtain the observed mean difference (reference).
Copied!
1PROC TTEST DATA=discriminate;
2 class STATUS;
3 *may need to convert School to numeric;
4 var Age;
5RUN;
3 Code Block
PROC IML Data
Explanation :
Use of the IML matrix language to generate 1000 random permutations of the 'Age' variable. Creation of a wide table 'newds' where the first column is the status and the following ones are the permutations.
Copied!
1ods OUTPUT off;
2ods exclude all;
3PROC IML ;
4 use discriminate;
5 read all var{STATUS Age} into x;
6 p=t(ranperm(x[, 2], 1000));
7 paf=x[, 1]||p;
8 create newds from paf;
9 append from paf;
10 QUIT;
4 Code Block
PROC TTEST Data
Explanation :
Massive execution of T-tests on the 1000 permuted columns (col2 to col1001) against the status (col1). ODS outputs are suppressed for performance, except for the 'conflimits' table which is saved in 'diff'.
Copied!
1ods OUTPUT conflimits=diff;
2 
3PROC TTEST DATA=newds plots=none;
4 class col1;
5 var col2 - col1001;
6RUN;
7 
8ods OUTPUT on;
9ods exclude none;
5 Code Block
PROC UNIVARIATE
Explanation :
Analysis of the distribution of randomly generated mean differences (Pooled method) and display of a histogram.
Copied!
1PROC UNIVARIATE DATA=diff;
2 where method="Pooled";
3 var mean;
4 histogram mean;
5RUN;
6 Code Block
DATA STEP Data
Explanation :
Filtering results to count how many permutations produced an absolute difference greater than or equal to the observed value (here hardcoded to 1.9238, which should correspond to the result of the first PROC TTEST).
Copied!
1DATA numdiffs;
2 SET diff;
3 where method="Pooled";
4 
5 IF abs(mean) >=1.9238;
6RUN;
7 Code Block
PROC PRINT
Explanation :
Display of extreme permutations to allow manual calculation of the p-value (number of extreme observations / 1000).
Copied!
1 
2PROC PRINT
3DATA=numdiffs;
4where method="Pooled";
5RUN;
6 
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : Mention 'borrowed code from internet' present in comments.


Related Documentation

Aucune documentation spécifique pour cette catégorie.