Permutation Test for Mean Comparison

Difficulty Level

Beginner

Published on : 11/01/2021

The script begins by defining a `trauma` dataset with in-line data. It then executes an observed t-test. Next, it uses PROC IML to generate thousands of permutations of the data and applies a t-test to each permutation. The results of these permuted t-tests (the mean differences) are stored in a `diff` dataset via ODS. Finally, it visualizes the distribution of permuted differences with PROC UNIVARIATE and calculates the number of permuted differences as extreme or more extreme than the observed difference to determine an empirical p-value.

Data Analysis

Type : CREATION_INTERNE

The 'trauma' dataset is created directly within the script using 'datalines'. The 'newds', 'diff', and 'numdiffs' datasets are generated internally by SAS procedures ('PROC IML', 'PROC TTEST', 'DATA STEP').

1 Code Block

DATA STEP Data

Explanation :
This DATA STEP block creates the 'trauma' dataset using in-line data (`datalines`). It contains two variables: `state` (categorical, indicating whether the subject is 'Non-Trauma' or 'Trauma') and `kcal` (numerical, representing a measurement).

Copied!

1	DATA trauma;
2	INPUT state $ kcal;
3	DATALINES;
4	0 19
5	0 20
6	0 20
7	0 21
8	0 21
9	0 21
10	1 22
11	0 23
12	0 23
13	1 23
14	1 25
15	1 26
16	1 30
17	1 38
18	1 39
19	;
20
21	/* 0 = Non-Trauma, 1 = Trauma */

2 Code Block

PROC TTEST

Explanation :
This `PROC TTEST` procedure performs a standard t-test to compare the mean of the `kcal` variable between the two groups defined by the `state` variable. This is the initial observed test whose mean difference will be compared to the permutation results.

Copied!

1	PROC TTEST DATA=trauma;
2	class state;
3	*may need to convert School to numeric;
4	var kcal;
5	RUN;

3 Code Block

ODS

Explanation :
These ODS (Output Delivery System) commands temporarily disable the generation of all output for subsequent SAS procedures, to avoid cluttering the log or output files with intermediate results from the many permutations.

Copied!

1	ods OUTPUT off;
2	ods exclude all;

4 Code Block

PROC IML Data

Explanation :
This `PROC IML` (Interactive Matrix Language) block is the core of the permutation test. It reads the 'trauma' dataset, then performs 5000 random permutations of the `kcal` variable using the `ranperm` function. The permuted data is then combined with the original `state` variable and saved into a new dataset named `newds`. This `newds` dataset will be used for the permutation t-tests.

Copied!

1	PROC IML ;
2	use trauma;
3	read all var{state kcal} into x;
4	*change varibale names here ... make sure it is class then var ... in that order.;
5	p=t(ranperm(x[, 2], 5000));
6	*Note that the "1000" here is the number of permutations. ;
7	paf=x[, 1]\|\|p;
8	create newds from paf;
9	append from paf;
10	QUIT;

5 Code Block

ODS

Explanation :
This ODS command captures the `conflimits` output table from the next `PROC TTEST` and saves it into a SAS dataset named `diff`. This table contains the confidence intervals and mean differences for each permutation.

Copied!

1	ods OUTPUT conflimits=diff;

6 Code Block

PROC TTEST

Explanation :
This `PROC TTEST` is applied to the `newds` dataset (which contains the permuted data). It compares the means of the `col2` variable (representing permuted `kcal`) between groups defined by `col1` (representing `state`). The `plots=none` parameter suppresses the generation of graphs for these numerous tests. The results of the mean differences are captured in the `diff` dataset by the preceding ODS command.

Copied!

1	PROC TTEST DATA=newds plots=none;
2	class col1;
3	var col2 - col1001;
4	RUN;

7 Code Block

ODS

Explanation :
These ODS commands reactivate the normal generation of output, allowing subsequent SAS procedures to produce their results.

Copied!

1	ods OUTPUT on;
2	ods exclude none;

8 Code Block

PROC UNIVARIATE

Explanation :
This `PROC UNIVARIATE` is used to analyze and visualize the distribution of mean differences obtained from the permutation t-tests (stored in the `diff` dataset). A histogram of the `mean` variable (the mean differences) is generated, showing the empirical null distribution.

Copied!

1	PROC UNIVARIATE DATA=diff;
2	where method="Pooled";
3	var mean;
4	histogram mean;
5	RUN;

9 Code Block

DATA STEP Data

Explanation :
This DATA STEP creates the 'numdiffs' dataset. It filters the `diff` dataset to identify permutations where the absolute value of the mean difference (`mean`) is greater than or equal to the observed difference (`7.8089`). The number of observations in this 'numdiffs' dataset will be used to calculate the empirical p-value.

Copied!

1	DATA numdiffs;
2	SET diff;
3	where method="Pooled";
4
5	IF abs(mean) >=7.8089;
6	*you will need to put the observed difference you got from t test above here. note if you have a one or two tailed test.;
7	RUN;

10 Code Block

PROC PRINT

Explanation :
This `PROC PRINT` displays the contents of the `numdiffs` dataset. It is used for a quick visual inspection of the permuted mean differences that were as extreme or more extreme than the observed difference.

Copied!

1
2	PROC PRINT
3	DATA=numdiffs;
4	where method="Pooled";
5	RUN;
6

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Difficulty Level

Published on : 11/01/2021

Data Analysis

1 Code Block

2 Code Block

3 Code Block

4 Code Block

5 Code Block

6 Code Block

7 Code Block

8 Code Block

9 Code Block

10 Code Block

Related Documentation