Statistical Analysis and Correlation Calculation on SASHELP.CARS

This code is also available in: Deutsch English Español Français
Difficulty Level
Beginner
Published on :
The script begins by examining the structure and metadata of the `sashelp.cars` dataset using `PROC CONTENTS`. It then proceeds with summary statistics: first a global summary displayed by `PROC PRINT`, then specific `MSRP` averages grouped by `origin` and `make`. It also calculates the overall averages for `WheelBase` and `Weight`. A section of the script attempts to manually calculate the Pearson correlation coefficient between `WheelBase` and `Weight` by deriving deviations from the mean and their products. It is important to note that a syntax error in the `DATA STEP` where `xy_dev = x_dev = y_dev;` performs a logical comparison instead of multiplication, rendering the manual Pearson calculation incorrect. The script then validates the correlation through a direct calculation via `PROC CORR`. Finally, a simple linear regression analysis is performed with `PROC REG` to model the relationship between `weight` and `wheelbase`.
Data Analysis

Type : MIXED


The script uses the built-in `sashelp.cars` dataset as the primary source. Several intermediate datasets (`cars_summary`, `msrp`, `WW_means`, `cars`, `dev`) are created dynamically during execution to store procedure results and transformed data, which are then used in subsequent steps.

1 Code Block
PROC CONTENTS
Explanation :
This procedure displays the data dictionary (metadata) for the `sashelp.cars` dataset. It provides information on variables, their types, formats, and lengths, which is essential for understanding data structure.
Copied!
1PROC CONTENTS DATA=sashelp.cars;
2RUN;
2 Code Block
PROC SUMMARY Data
Explanation :
The first `PROC SUMMARY` calculates basic descriptive statistics for all numeric variables in the `sashelp.cars` dataset and stores these statistics in a new dataset named `cars_summary`. The subsequent `PROC PRINT` displays the contents of the `cars_summary` dataset.
Copied!
1PROC SUMMARY DATA=sashelp.cars;
2OUTPUT out=cars_summary;
3 
4RUN;
5PROC PRINT DATA=cars_summary;
6RUN;
3 Code Block
PROC SUMMARY Data
Explanation :
This `PROC SUMMARY` calculates the average retail price (`MSRP`) of cars. Statistics are grouped by `origin` and `make` (classification variables). The `NWAY` option ensures that the output only contains the most detailed combinations of the `CLASS` variables. The result, with the `average_msrp` variable, is stored in the `msrp` dataset.
Copied!
1PROC SUMMARY DATA = sashelp.cars nway;
2class origin make;
3var msrp;
4OUTPUT out = msrp mean(msrp) = average_msrp;
5RUN;
4 Code Block
PROC SUMMARY Data
Explanation :
This `PROC SUMMARY` calculates the means of the `wheelbase` and `weight` variables from the `sashelp.cars` dataset. The calculated means are stored in the `WW_means` dataset under the names `mean_wheelbase` and `mean_weight`.
Copied!
1PROC SUMMARY DATA = sashelp.cars;
2 var wheelbase weight;
3 OUTPUT out = WW_means mean(WheelBase Weight) = mean_wheelbase mean_weight;
4RUN;
5 Code Block
DATA STEP Data
Explanation :
This `DATA` step creates a new `cars` dataset. It merges the means (`mean_wheelbase`, `mean_weight`) from the `WW_means` dataset with `sashelp.cars` using a 'persistent set' technique where the means are read only once for the first record (`_n_ eq 1`). It then calculates deviations of `wheelbase` (`x_dev`) and `weight` (`y_dev`) from their means. The line `xy_dev = x_dev = y_dev;` performs a logical comparison, assigning 1 to `xy_dev` if `x_dev` equals `y_dev`, and 0 otherwise. For a Pearson correlation calculation, this line should be `xy_dev = x_dev * y_dev;`.
Copied!
1DATA cars;
2 SET sashelp.cars;
3 IF (_n_ eq 1) THEN SET ww_means;
4 x_dev = wheelbase - mean_wheelbase;
5 y_dev = weight - mean_weight;
6 xy_dev = x_dev = y_dev;
7 OUTPUT;
8RUN;
6 Code Block
PROC SUMMARY Data
Explanation :
This `PROC SUMMARY` takes the `cars` dataset as input. It calculates the uncorrected sum of squares (`USS`) for `x_dev` and `y_dev` (stored in `x_ss` and `y_ss`), and the sum (`SUM`) of `xy_dev` (stored in `xy_ss`). These statistics are intermediate for the manual calculation of the Pearson correlation coefficient.
Copied!
1PROC SUMMARY DATA = cars;
2 var x_dev y_dev xy_dev;
3 OUTPUT out = dev uss(x_dev y_dev) = x_ss y_ss sum(xy_dev) = xy_ss;
4RUN;
7 Code Block
DATA STEP Data
Explanation :
This `DATA` step takes the `dev` dataset and adds a new `PearsonCorrelation` variable to it. It uses the standard formula to calculate the Pearson correlation coefficient from the sums of squares (`x_ss`, `y_ss`) and the sum of products of deviations (`xy_ss`). However, due to the error in the `xy_dev` calculation in a previous `DATA` step, the `PearsonCorrelation` calculated here will not represent the true correlation.
Copied!
1DATA dev;
2 SET dev;
3 PearsonCorrelation = xy_ss/(sqrt(x_ss) *sqrt(y_ss));
4RUN;
8 Code Block
PROC CORR
Explanation :
This `PROC CORR` directly calculates the Pearson correlation coefficient between the `WheelBase` and `Weight` variables from the `sashelp.cars` dataset. This is the standard and recommended method for obtaining correlations.
Copied!
1 
2PROC CORR
3DATA = sashelp.cars;
4var WheelBase Weight;
5RUN;
6 
9 Code Block
PROC REG
Explanation :
This `PROC REG` performs a linear regression analysis. It models the `weight` (dependent) variable as a function of the `wheelbase` (independent) variable using the `sashelp.cars` dataset, providing statistics on model fit, regression coefficients, and ANOVA.
Copied!
1 
2PROC REG
3DATA=sashelp.cars;
4model weight=wheelbase;
5RUN;
6 
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Related Documentation

Aucune documentation spécifique pour cette catégorie.