Published on :
Statistical CREATION_INTERNE

Linear Regression Analysis and Visualization

This code is also available in: Deutsch Français
Awaiting validation
The script consists of two independent analyses. The first part creates a 'drinking' table to analyze the link between alcohol consumption and cirrhosis rates by country. It generates a scatter plot, executes several regression models with PROC REG, including a model excluding a specific country (France), and identifies influential observations. The second part creates a 'universe' table containing data on galaxy distance and velocity. It visualizes this data and fits a linear regression model without an intercept to illustrate Hubble's Law, also identifying high-leverage points.
Data Analysis

Type : CREATION_INTERNE


Both datasets, 'drinking' and 'universe', are generated within the script using a DATA STEP and the 'cards' or 'datalines' statement, making them self-contained.

1 Code Block
DATA STEP Data
Explanation :
Creates the 'drinking' work table from manually entered data using the 'cards' statement. The table contains three variables: country name, alcohol consumption, and cirrhosis rate.
Copied!
1DATA drinking;
2 INPUT country $ 1-12 alcohol cirrhosis;
3CARDS;
4France 24.7 46.1
5Italy 15.2 23.6
6W.Germany 12.3 23.7
7Austria 10.9 7.0
8Belgium 10.8 12.3
9USA 9.9 14.2
10Canada 8.3 7.4
11E&W 7.2 3.0
12Sweden 6.6 7.2
13Japan 5.8 10.6
14Netherlands 5.7 3.7
15Ireland 5.6 3.4
16Norway 4.2 4.3
17Finland 3.9 3.6
18Israel 3.1 5.4
19;
20RUN;
2 Code Block
PROC SGPLOT
Explanation :
Generates a scatter plot to visualize the relationship between alcohol consumption ('alcohol') and cirrhosis ('cirrhosis'). Each point is labeled with the country name. The commented-out code block shows an older method to obtain a similar result with PROC GPLOT.
Copied!
1PROC SGPLOT DATA=drinking;
2 scatter x=alcohol y=cirrhosis / datalabel=country;
3RUN;
4 
5/*
6symbol1 pointlabel=('#country');
7proc gplot data=drinking;
8 plot cirrhosis*alcohol ;
9run;
10*/
3 Code Block
PROC REG
Explanation :
Performs a simple linear regression to model the cirrhosis rate as a function of alcohol consumption. `ODS GRAPHICS ON` automatically generates regression diagnostic plots. The commented-out code presents an alternative for superimposing a regression line on a scatter plot with PROC SGPLOT.
Copied!
1ods graphics on;
2PROC REG DATA=drinking;
3 model cirrhosis=alcohol;
4RUN;
5ods graphics off;
6 
7/*
8proc sgplot data=drinking;
9 scatter x=alcohol y=cirrhosis ;
10 reg x=alcohol y=cirrhosis / clm;
11run;
12*/
4 Code Block
PROC REG
Explanation :
Executes a new linear regression model excluding the observation for France, which was identified as a potentially influential point in the previous graph.
Copied!
1PROC REG DATA=drinking;
2 model cirrhosis=alcohol;
3 where country ne 'France';
4RUN; QUIT;
5 Code Block
PROC REG Data
Explanation :
Re-executes the regression on the entire dataset and saves the diagnostic statistics into a new 'regout' table. The PROC PRINT procedure is then used to display observations that are considered outliers (absolute studentized residual > 2) or influential points (leverage > 0.3).
Copied!
1PROC REG DATA=drinking;
2 model cirrhosis=alcohol;
3 OUTPUT out=regout predicted=pred student=zres h=leverage;
4RUN; QUIT;
5 
6PROC PRINT DATA=regout;
7 where abs(zres)>2 or leverage>.3;
8RUN;
6 Code Block
DATA STEP Data
Explanation :
Creates the 'universe' work table from galaxy data (ID, name, velocity, distance) manually entered using the 'datalines' statement.
Copied!
1DATA universe;
2 INPUT id Galaxy $ Velocity Distance;
3DATALINES;
41 NGC0300 133 2.00
52 NGC0925 664 9.16
63 NGC1326A 1794 16.14
74 NGC1365 1594 17.95
85 NGC1425 1473 21.88
96 NGC2403 278 3.22
107 NGC2541 714 11.22
118 NGC2090 882 11.75
129 NGC3031 80 3.63
1310 NGC3198 772 13.80
1411 NGC3351 642 10.00
1512 NGC3368 768 10.52
1613 NGC3621 609 6.64
1714 NGC4321 1433 15.21
1815 NGC4414 619 17.70
1916 NGC4496A 1424 14.86
2017 NGC4548 1384 16.22
2118 NGC4535 1444 15.78
2219 NGC4536 1423 14.93
2320 NGC4639 1403 21.98
2421 NGC4725 1103 12.36
2522 IC4182 318 4.49
2623 NGC5253 232 3.15
2724 NGC7331 999 14.72
28;
29RUN;
7 Code Block
PROC SGPLOT
Explanation :
Generates a scatter plot to visualize the relationship between a galaxy's distance and its recession velocity, adding labels to the axes. The commented-out code shows the equivalent with the deprecated PROC GPLOT procedure.
Copied!
1PROC SGPLOT DATA=universe;
2 scatter y=velocity x=Distance;
3 yaxis label='velocity (kms)';
4 xaxis label='Distance (mega-parsec)';
5RUN;
6 
7/*
8proc gplot data=universe;
9 plot velocity*Distance;
10 label velocity='velocity (kms)' distance='Distance (mega-parsec)';
11run;
12*/
8 Code Block
PROC REG
Explanation :
Fits a linear regression model for velocity as a function of distance. The `NOINT` option forces the regression line to pass through the origin, which is consistent with Hubble's Law (Velocity = H0 * Distance).
Copied!
1ods graphics on;
2PROC REG DATA=universe;
3 model velocity= distance / noint;
4RUN;
5ods graphics off;
6QUIT;
9 Code Block
PROC REG Data
Explanation :
Re-executes the regression without an intercept and saves the diagnostic statistics to the 'regout' table (overwriting the previous one). PROC PRINT then displays observations with a leverage greater than 0.08, thus identifying the most influential points on the model estimation.
Copied!
1PROC REG DATA=universe;
2 model velocity= distance / noint;
3 OUTPUT out=regout predicted=pred student=zres h=leverage;
4RUN; QUIT;
5 
6PROC PRINT DATA=regout;
7 where leverage>.08;
8RUN;
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.