Goodness-of-Fit (GOF) Test for DataSet III

The script begins by creating a `haseman_soares` dataset from inline data (`datalines`). It then transforms this dataset to have one observation per frequency. Two macros, `%GOF_BB` and `%GOF_RCB`, are called to perform the Goodness-of-Fit tests. The residuals generated by these macros are then combined, sorted, and ranked to prepare for the creation of Q-Q plots. Finally, `PROC SGPANEL` is used to visualize these residuals as Q-Q plots, allowing for the evaluation of the residual distribution.

Data Analysis

Type : INTERNAL_CREATION

The initial `haseman_soares` dataset is created directly within the script via a `datalines` statement, then transformed to expand observations based on the `freq` column.

1 Code Block

DATA STEP Data

Explanation :
This DATA STEP block creates the `haseman_soares` dataset using raw data provided by `datalines`. It reads variables `m` and `t1` to `t10`. A `DO OVER` loop on the array `tt` (composed of `t1` to `t10`) is used to transform the data, creating rows for each non-null `freq`, where `t` is the column index and `freq` is the corresponding value.

Copied!

1	DATA haseman_soares;
2	INPUT m t1-t10;
3	array tt t1-t10;
4	DO over tt;
5	t = _i_ - 1;
6	freq = tt;
7	OUTPUT;
8	END;
9	keep m t freq;
10	DATALINES;
11	1 7 . . . . . . . . .
12	2 7 . . . . . . . . .
13	3 6 . . . . . . . . .
14	4 5 2 1 . . . . . . .
15	5 8 2 1 . 1 1 . . . .
16	6 8 . . . . . . . . .
17	7 4 4 2 1 . . . . . .
18	8 7 7 1 . . . . . . .
19	9 8 9 7 1 1 . . . . .
20	10 22 17 2 . 1 . 1 1 .
21	11 30 18 9 1 2 . 1 . 1 .
22	12 54 27 12 2 1 . 2 . . .
23	13 46 30 8 4 1 1 . 1 . .
24	14 43 21 13 3 1 . . 1 . 1
25	15 22 22 5 2 1 . . . . .
26	16 6 6 3 . 1 1 . . . .
27	18 3 . 2 1 . . . . . .
28	;

2 Code Block

DATA STEP Data

Explanation :
This DATA STEP block post-processes the `haseman_soares` dataset. It deletes observations where `freq` is missing. For each remaining observation, it generates a number of rows equal to the value of `freq`, thus denormalizing the data so that each row represents a single occurrence of the event. Variables `i` and `freq` are then dropped.

Copied!

1	DATA haseman_soares;
2	SET haseman_soares;
3	IF freq = . THEN delete;
4	DO i=1 to freq;
5	OUTPUT;
6	END;
7	drop i freq;
8	RUN;

3 Code Block

ODS

Explanation :
These ODS (Output Delivery System) statements enable HTML output and graphics generation for subsequent procedures. The output will be saved in HTML format.

Copied!

1	ods html;
2	ods graphics on;

4 Code Block

MACRO

Explanation :
This block calls two SAS macros, `%GOF_BB` and `%GOF_RCB`, which are assumed to perform Goodness-of-Fit tests. They take the `haseman_soares` dataset as input and use variables `t` and `m` for their calculations. The `title2` parameter is used to add a subtitle to the output generated by the macros.

Copied!

1	%GOF_BB (inds=haseman_soares,t=t,m=m,title2=DataSet III -- Haseman and Soares (1976));
2	%GOF_RCB(inds=haseman_soares,t=t,m=m,title2=Dataset III -- Haseman and Soares (1976));
3

5 Code Block

DATA STEP Data

Explanation :
This DATA STEP combines the `Resid_BB` and `Resid_RCB` datasets, which contain the residuals from the Goodness-of-Fit tests, into a new single dataset named `Resid_BB_RCB`. This prepares the data for unified analysis and visualization.

Copied!

1	*--- Construct QQ Plots;
2	DATA Resid_BB_RCB;
3	SET Resid_BB Resid_RCB;
4	RUN;

6 Code Block

PROC SORT

Explanation :
The `PROC SORT` procedure sorts the `Resid_BB_RCB` dataset by the `Distribution` variable. This sorting is essential for subsequent analysis steps, particularly for `PROC RANK` and `PROC SGPANEL` which may require data sorted by group.

Copied!

1
2	PROC SORT
3	DATA=Resid_BB_RCB;
4	BY Distribution;
5
6	RUN;
7

7 Code Block

PROC RANK Data

Explanation :
The `PROC RANK` procedure is used to calculate normal ranks (normal quantiles) of the residuals. It takes `Resid_BB_RCB` as input and creates a new dataset `new_qqplots`. The `normal=blom` option uses Blom's formula for calculating normal scores, and `ties=mean` handles ties by assigning them the mean rank. The `Resid` variable is ranked, and the result is stored in the new variable `NQuant`, grouped by `Distribution`.

Copied!

1	PROC RANK DATA=Resid_BB_RCB out=new_qqplots normal=blom ties=mean;
2	BY Distribution;
3	var Resid;
4	ranks NQuant;
5	RUN;

8 Code Block

PROC SGPANEL

Explanation :
This block uses `PROC SGPANEL` to generate Q-Q (quantile-quantile) plots of the residuals. The plot is paneled by `Distribution`, meaning a distinct Q-Q plot will be created for each `Distribution` value. Titles are defined, axis labels are customized, and a regression line (`reg`) is added to the plot to facilitate comparison of residuals with a theoretical normal distribution.

Copied!

1	PROC SGPANEL DATA=new_qqplots noautolegend;
2	panelby Distribution;
3	title1 "DataSet III -- Haseman and Soares (1976)";
4	title2 "QQ-Plots of Residuals based on Observed and Expected Frequencies";
5	label Resid="Residuals" NQuant="Normal Quantiles";
6	reg x=Resid y=NQuant;
7	RUN;

9 Code Block

ODS

Explanation :
These instructions disable ODS graphics generation and close the HTML destination, thus terminating the output of results to the HTML file.

Copied!

1	ods graphics off;
2	ods html close;

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Retour à la liste