The Ultimate Signal Smoothing Hack: Recovering the Bumps Function using PROC GLMSELECT

This code is also available in: Deutsch Français
Difficulty Level
Beginner
Published on :
Simon

Expert Advice

Simon
Expert SAS et fondateur.

When using the multiscale knot selection method in PROC GLMSELECT, SAS applies a localized approach to spline fitting. This is essential for the Bumps function because standard global smoothing methods (like LOESS or simple polynomials) tend to "oversmooth" the peaks, cutting off the height of the bumps. The multiscale method allows the model to be flexible where the data changes rapidly (the peaks) while staying smooth in the valleys.

The script begins by generating a dataset named 'DoJoBumps'. This dataset contains an 'x' variable, a 'bumps' variable calculated from a complex mathematical function (the 'Bumps' function), and a 'bumpsWithNoise' variable which is the 'bumps' variable with added random noise. Then, several plots are produced with PROC SGPLOT to visualize the original data, the noisy data, and attempts at smoothing with LOESS and PBSPLINE methods. The main step uses PROC GLMSELECT with a spline effect (SPLINE) on the 'x' variable to model the 'bumpsWithNoise' variable. The knot selection method is 'multiscale'. Finally, the result of the GLMSELECT model prediction is plotted and compared to the original 'bumps' curve to evaluate the quality of the fit.
Data Analysis

Type : CREATION_INTERNE


The 'DoJoBumps' dataset is entirely created algorithmically within the DATA step. It does not depend on any external data source or SASHELP. The data is generated to simulate the 'Bumps' function with added noise.

1 Code Block
DATA STEP Data
Explanation :
This DATA STEP block creates the 'DoJoBumps' table. It generates 2048 observations. For each observation, it calculates an 'x' value, then calls a 'compute' subroutine to calculate the 'bumps' value based on a complex formula involving 11 coefficients (Donoho and Johnstone's Bumps function). Gaussian noise, whose seed is fixed by the macro variable 'random', is added to create the 'bumpsWithNoise' variable.
Copied!
1DATA DoJoBumps;
2 keep x bumps bumpsWithNoise;
3 
4 pi = arcos(-1);
5 
6 DO n=1 to 2048;
7 x=(2*n-1)/4096;
8 link compute;
9 bumpsWithNoise=bumps+rannor(&random)*sqrt(5);
10 OUTPUT;
11 END;
12 stop;
13 
14compute:
15 array t(11) _temporary_ (.1 .13 .15 .23 .25 .4 .44 .65 .76 .78 .81);
16 array b(11) _temporary_ ( 4 5 3 4 5 4.2 2.1 4.3 3.1 5.1 4.2);
17 array w(11) _temporary_ (.005 .005 .006 .01 .01 .03 .01 .01 .005 .008 .005);
18 
19 bumps=0;
20 DO i=1 to 11;
21 bumps=bumps+b[i]*(1+abs((x-t[i])/w[i]))**-4;
22 END;
23 bumps=bumps*10.528514619;
24 return;
25RUN;
2 Code Block
PROC SGPLOT
Explanation :
This block uses PROC SGPLOT to overlay two series plots: the noisy data curve ('bumpsWithNoise') in black and the original un-noisy curve ('bumps') in red. This allows visualizing the effect of added noise on the original function.
Copied!
1PROC SGPLOT DATA=DoJoBumps;
2 yaxis display=(nolabel);
3 series x=x y=bumpsWithNoise/lineattrs=(color=black);
4 series x=x y=bumps/lineattrs=(color=red);
5RUN;
3 Code Block
PROC SGPLOT
Explanation :
This block uses PROC SGPLOT to compare the original 'bumps' curve with a LOESS smoothing curve applied to the noisy data 'bumpsWithNoise'. LOESS smoothing is a non-parametric method for estimating local trend and shows a first attempt at denoising.
Copied!
1PROC SGPLOT DATA=DoJoBumps;
2 yaxis display=(nolabel);
3 series x=x y=bumps;
4 loess x=x y=bumpsWithNoise / lineattrs=(color=red) nomarkers;
5RUN;
4 Code Block
PROC SGPLOT
Explanation :
This block uses PROC SGPLOT to compare the original 'bumps' curve with a penalized B-spline smoothing curve applied to the noisy data 'bumpsWithNoise'. This is another smoothing method, often more flexible than LOESS.
Copied!
1PROC SGPLOT DATA=DoJoBumps;
2 yaxis display=(nolabel);
3 series x=x y=bumps;
4 pbspline x=x y=bumpsWithNoise /
5 lineattrs=(color=red) nomarkers;
6RUN;
5 Code Block
PROC GLMSELECT
Explanation :
This block is the core of the analysis. It uses PROC GLMSELECT to model the dependent variable 'bumpsWithNoise'. The 'EFFECT spl = spline(x ...)' statement defines a spline effect on the 'x' variable. The 'multiscale' method is used for spline knot selection, which is effective for functions with multiple variations. The model is then fitted and predictions are saved in an 'out1' table under the 'pBumps' variable.
Copied!
1PROC GLMSELECT DATA=dojoBumps;
2 effect spl = spline(x / knotmethod=multiscale(endscale=8)
3 split details);
4 model bumpsWithNoise=spl;
5 OUTPUT out=out1 p=pBumps;
6RUN;
6 Code Block
PROC SGPLOT
Explanation :
This final block uses PROC SGPLOT to visualize the quality of the GLMSELECT model fit. It overlays the original 'bumps' curve with the curve of predicted values ('pBumps') by the spline model. This shows how the model successfully recovered the underlying data structure despite the noise.
Copied!
1PROC SGPLOT DATA=out1;
2 yaxis display=(nolabel);
3 series x=x y=bumps;
4 series x=x y=pBumps / lineattrs=(color=red);
5RUN;
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : S A S S A M P L E L I B R A R Y


Related Documentation

Aucune documentation spécifique pour cette catégorie.