Linear Regression Analysis and Visualization

The script consists of two independent analyses. The first part creates a 'drinking' table to analyze the link between alcohol consumption and cirrhosis rates by country. It generates a scatter plot, executes several regression models with PROC REG, including a model excluding a specific country (France), and identifies influential observations. The second part creates a 'universe' table containing data on galaxy distance and velocity. It visualizes this data and fits a linear regression model without an intercept to illustrate Hubble's Law, also identifying high-leverage points.

Data Analysis

Type : CREATION_INTERNE

Both datasets, 'drinking' and 'universe', are generated within the script using a DATA STEP and the 'cards' or 'datalines' statement, making them self-contained.

1 Code Block

DATA STEP Data

Explanation :
Creates the 'drinking' work table from manually entered data using the 'cards' statement. The table contains three variables: country name, alcohol consumption, and cirrhosis rate.

Copied!

1	DATA drinking;
2	INPUT country $ 1-12 alcohol cirrhosis;
3	CARDS;
4	France 24.7 46.1
5	Italy 15.2 23.6
6	W.Germany 12.3 23.7
7	Austria 10.9 7.0
8	Belgium 10.8 12.3
9	USA 9.9 14.2
10	Canada 8.3 7.4
11	E&W 7.2 3.0
12	Sweden 6.6 7.2
13	Japan 5.8 10.6
14	Netherlands 5.7 3.7
15	Ireland 5.6 3.4
16	Norway 4.2 4.3
17	Finland 3.9 3.6
18	Israel 3.1 5.4
19	;
20	RUN;

2 Code Block

PROC SGPLOT

Explanation :
Generates a scatter plot to visualize the relationship between alcohol consumption ('alcohol') and cirrhosis ('cirrhosis'). Each point is labeled with the country name. The commented-out code block shows an older method to obtain a similar result with PROC GPLOT.

Copied!

1	PROC SGPLOT DATA=drinking;
2	scatter x=alcohol y=cirrhosis / datalabel=country;
3	RUN;
4
5	/*
6	symbol1 pointlabel=('#country');
7	proc gplot data=drinking;
8	plot cirrhosis*alcohol ;
9	run;
10	*/

3 Code Block

PROC REG

Explanation :
Performs a simple linear regression to model the cirrhosis rate as a function of alcohol consumption. `ODS GRAPHICS ON` automatically generates regression diagnostic plots. The commented-out code presents an alternative for superimposing a regression line on a scatter plot with PROC SGPLOT.

Copied!

1	ods graphics on;
2	PROC REG DATA=drinking;
3	model cirrhosis=alcohol;
4	RUN;
5	ods graphics off;
6
7	/*
8	proc sgplot data=drinking;
9	scatter x=alcohol y=cirrhosis ;
10	reg x=alcohol y=cirrhosis / clm;
11	run;
12	*/

4 Code Block

PROC REG

Explanation :
Executes a new linear regression model excluding the observation for France, which was identified as a potentially influential point in the previous graph.

Copied!

1	PROC REG DATA=drinking;
2	model cirrhosis=alcohol;
3	where country ne 'France';
4	RUN; QUIT;

5 Code Block

PROC REG Data

Explanation :
Re-executes the regression on the entire dataset and saves the diagnostic statistics into a new 'regout' table. The PROC PRINT procedure is then used to display observations that are considered outliers (absolute studentized residual > 2) or influential points (leverage > 0.3).

Copied!

1	PROC REG DATA=drinking;
2	model cirrhosis=alcohol;
3	OUTPUT out=regout predicted=pred student=zres h=leverage;
4	RUN; QUIT;
5
6	PROC PRINT DATA=regout;
7	where abs(zres)>2 or leverage>.3;
8	RUN;

6 Code Block

DATA STEP Data

Explanation :
Creates the 'universe' work table from galaxy data (ID, name, velocity, distance) manually entered using the 'datalines' statement.

Copied!

1	DATA universe;
2	INPUT id Galaxy $ Velocity Distance;
3	DATALINES;
4	1 NGC0300 133 2.00
5	2 NGC0925 664 9.16
6	3 NGC1326A 1794 16.14
7	4 NGC1365 1594 17.95
8	5 NGC1425 1473 21.88
9	6 NGC2403 278 3.22
10	7 NGC2541 714 11.22
11	8 NGC2090 882 11.75
12	9 NGC3031 80 3.63
13	10 NGC3198 772 13.80
14	11 NGC3351 642 10.00
15	12 NGC3368 768 10.52
16	13 NGC3621 609 6.64
17	14 NGC4321 1433 15.21
18	15 NGC4414 619 17.70
19	16 NGC4496A 1424 14.86
20	17 NGC4548 1384 16.22
21	18 NGC4535 1444 15.78
22	19 NGC4536 1423 14.93
23	20 NGC4639 1403 21.98
24	21 NGC4725 1103 12.36
25	22 IC4182 318 4.49
26	23 NGC5253 232 3.15
27	24 NGC7331 999 14.72
28	;
29	RUN;

7 Code Block

PROC SGPLOT

Explanation :
Generates a scatter plot to visualize the relationship between a galaxy's distance and its recession velocity, adding labels to the axes. The commented-out code shows the equivalent with the deprecated PROC GPLOT procedure.

Copied!

1	PROC SGPLOT DATA=universe;
2	scatter y=velocity x=Distance;
3	yaxis label='velocity (kms)';
4	xaxis label='Distance (mega-parsec)';
5	RUN;
6
7	/*
8	proc gplot data=universe;
9	plot velocity*Distance;
10	label velocity='velocity (kms)' distance='Distance (mega-parsec)';
11	run;
12	*/

8 Code Block

PROC REG

Explanation :
Fits a linear regression model for velocity as a function of distance. The `NOINT` option forces the regression line to pass through the origin, which is consistent with Hubble's Law (Velocity = H0 * Distance).

Copied!

1	ods graphics on;
2	PROC REG DATA=universe;
3	model velocity= distance / noint;
4	RUN;
5	ods graphics off;
6	QUIT;

9 Code Block

PROC REG Data

Explanation :
Re-executes the regression without an intercept and saves the diagnostic statistics to the 'regout' table (overwriting the previous one). PROC PRINT then displays observations with a leverage greater than 0.08, thus identifying the most influential points on the model estimation.

Copied!

1	PROC REG DATA=universe;
2	model velocity= distance / noint;
3	OUTPUT out=regout predicted=pred student=zres h=leverage;
4	RUN; QUIT;
5
6	PROC PRINT DATA=regout;
7	where leverage>.08;
8	RUN;

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Retour à la liste