Statistical Analysis and Correlation Calculation on SASHELP.CARS

Difficulty Level

Beginner

Published on : 18/07/2023

The script begins by examining the structure and metadata of the `sashelp.cars` dataset using `PROC CONTENTS`. It then proceeds with summary statistics: first a global summary displayed by `PROC PRINT`, then specific `MSRP` averages grouped by `origin` and `make`. It also calculates the overall averages for `WheelBase` and `Weight`. A section of the script attempts to manually calculate the Pearson correlation coefficient between `WheelBase` and `Weight` by deriving deviations from the mean and their products. It is important to note that a syntax error in the `DATA STEP` where `xy_dev = x_dev = y_dev;` performs a logical comparison instead of multiplication, rendering the manual Pearson calculation incorrect. The script then validates the correlation through a direct calculation via `PROC CORR`. Finally, a simple linear regression analysis is performed with `PROC REG` to model the relationship between `weight` and `wheelbase`.

Data Analysis

Type : MIXED

The script uses the built-in `sashelp.cars` dataset as the primary source. Several intermediate datasets (`cars_summary`, `msrp`, `WW_means`, `cars`, `dev`) are created dynamically during execution to store procedure results and transformed data, which are then used in subsequent steps.

1 Code Block

PROC CONTENTS

Explanation :
This procedure displays the data dictionary (metadata) for the `sashelp.cars` dataset. It provides information on variables, their types, formats, and lengths, which is essential for understanding data structure.

Copied!

1	PROC CONTENTS DATA=sashelp.cars;
2	RUN;

2 Code Block

PROC SUMMARY Data

Explanation :
The first `PROC SUMMARY` calculates basic descriptive statistics for all numeric variables in the `sashelp.cars` dataset and stores these statistics in a new dataset named `cars_summary`. The subsequent `PROC PRINT` displays the contents of the `cars_summary` dataset.

Copied!

1	PROC SUMMARY DATA=sashelp.cars;
2	OUTPUT out=cars_summary;
3
4	RUN;
5	PROC PRINT DATA=cars_summary;
6	RUN;

3 Code Block

PROC SUMMARY Data

Explanation :
This `PROC SUMMARY` calculates the average retail price (`MSRP`) of cars. Statistics are grouped by `origin` and `make` (classification variables). The `NWAY` option ensures that the output only contains the most detailed combinations of the `CLASS` variables. The result, with the `average_msrp` variable, is stored in the `msrp` dataset.

Copied!

1	PROC SUMMARY DATA = sashelp.cars nway;
2	class origin make;
3	var msrp;
4	OUTPUT out = msrp mean(msrp) = average_msrp;
5	RUN;

4 Code Block

PROC SUMMARY Data

Explanation :
This `PROC SUMMARY` calculates the means of the `wheelbase` and `weight` variables from the `sashelp.cars` dataset. The calculated means are stored in the `WW_means` dataset under the names `mean_wheelbase` and `mean_weight`.

Copied!

1	PROC SUMMARY DATA = sashelp.cars;
2	var wheelbase weight;
3	OUTPUT out = WW_means mean(WheelBase Weight) = mean_wheelbase mean_weight;
4	RUN;

5 Code Block

DATA STEP Data

Explanation :
This `DATA` step creates a new `cars` dataset. It merges the means (`mean_wheelbase`, `mean_weight`) from the `WW_means` dataset with `sashelp.cars` using a 'persistent set' technique where the means are read only once for the first record (`_n_ eq 1`). It then calculates deviations of `wheelbase` (`x_dev`) and `weight` (`y_dev`) from their means. The line `xy_dev = x_dev = y_dev;` performs a logical comparison, assigning 1 to `xy_dev` if `x_dev` equals `y_dev`, and 0 otherwise. For a Pearson correlation calculation, this line should be `xy_dev = x_dev * y_dev;`.

Copied!

1	DATA cars;
2	SET sashelp.cars;
3	IF (_n_ eq 1) THEN SET ww_means;
4	x_dev = wheelbase - mean_wheelbase;
5	y_dev = weight - mean_weight;
6	xy_dev = x_dev = y_dev;
7	OUTPUT;
8	RUN;

6 Code Block

PROC SUMMARY Data

Explanation :
This `PROC SUMMARY` takes the `cars` dataset as input. It calculates the uncorrected sum of squares (`USS`) for `x_dev` and `y_dev` (stored in `x_ss` and `y_ss`), and the sum (`SUM`) of `xy_dev` (stored in `xy_ss`). These statistics are intermediate for the manual calculation of the Pearson correlation coefficient.

Copied!

1	PROC SUMMARY DATA = cars;
2	var x_dev y_dev xy_dev;
3	OUTPUT out = dev uss(x_dev y_dev) = x_ss y_ss sum(xy_dev) = xy_ss;
4	RUN;

7 Code Block

DATA STEP Data

Explanation :
This `DATA` step takes the `dev` dataset and adds a new `PearsonCorrelation` variable to it. It uses the standard formula to calculate the Pearson correlation coefficient from the sums of squares (`x_ss`, `y_ss`) and the sum of products of deviations (`xy_ss`). However, due to the error in the `xy_dev` calculation in a previous `DATA` step, the `PearsonCorrelation` calculated here will not represent the true correlation.

Copied!

1	DATA dev;
2	SET dev;
3	PearsonCorrelation = xy_ss/(sqrt(x_ss) *sqrt(y_ss));
4	RUN;

8 Code Block

PROC CORR

Explanation :
This `PROC CORR` directly calculates the Pearson correlation coefficient between the `WheelBase` and `Weight` variables from the `sashelp.cars` dataset. This is the standard and recommended method for obtaining correlations.

Copied!

1
2	PROC CORR
3	DATA = sashelp.cars;
4	var WheelBase Weight;
5	RUN;
6

9 Code Block

PROC REG

Explanation :
This `PROC REG` performs a linear regression analysis. It models the `weight` (dependent) variable as a function of the `wheelbase` (independent) variable using the `sashelp.cars` dataset, providing statistics on model fit, regression coefficients, and ANOVA.

Copied!

1
2	PROC REG
3	DATA=sashelp.cars;
4	model weight=wheelbase;
5	RUN;
6

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Difficulty Level

Published on : 18/07/2023

Data Analysis

1 Code Block

2 Code Block

3 Code Block

4 Code Block

5 Code Block

6 Code Block

7 Code Block

8 Code Block

9 Code Block

Related Documentation