The functional analysis focuses on identifying and resolving challenges related to merging datasets. It covers the use of PROC CONTENTS and PROC PRINT for initial data examination, PROC SQL for common variable detection, and DATA steps with functions like CATX and INPUT, as well as LENGTH and ATTRIB statements, to standardize variable attributes. Examples illustrate how to handle differences in variable type, length, format, and label, and how to create unique identifiers for duplicate BY values to ensure correct merge operations. Relationships between datasets (one-to-many, many-to-one, one-to-one) are also discussed, emphasizing the importance of adequate preparation for each combination method.
Data Analysis
Type : CREATION_INTERNE
Examples use generated data (datalines) or SASHELP, as well as predefined datasets such as Inventory, Sales, Sales2019, CarsSmall, Class, and Classfit, created internally for demonstration purposes.
1 Code Block
PROC CONTENTS / PROC SORT / PROC PRINT Data
Explanation : Displays descriptive information of input datasets to identify common variables and compare their attributes. Sorts datasets by the PartNumber variable. Prints datasets to examine common variable values and determine the relationship between datasets.
Explanation : Creates a temporary table 'commonvars' from the columns of the 'INVENTORY', 'SALES', and 'SALES2019' datasets in the WORK library. It then selects variable names that are common to all these datasets by grouping by name and checking that the count matches the number of unique datasets.
Copied!
proc sql;
title "Variables Common to Inventory, Sales, and Sales2019";
create table commonvars as
select memname, upcase(name) as name
from dictionary.columns
where libname='WORK' and
memname in ('INVENTORY', 'SALES', 'SALES2019');
select name
from commonvars
group by name
having count(*)=(select count(distinct(memname)) from commonvars);
quit;
1
PROC SQL;
2
title "Variables Common to Inventory, Sales, and Sales2019";
3
create TABLE commonvars as
4
select memname, upcase(name) as name
5
from dictionary.columns
6
where LIBNAME='WORK' and
7
memname in ('INVENTORY', 'SALES', 'SALES2019');
8
select name
9
from commonvars
10
group BY name
11
having count(*)=(select count(distinct(memname)) from commonvars);
12
QUIT;
3 Code Block
PROC SORT / PROC FREQ / DATA STEP Data
Explanation : Sorts sales data by 'partNumber'. Uses PROC FREQ to identify duplicate 'partNumber' values and stores them in 'SalesDupes'. Then, in a DATA step, creates a new unique identifier 'uniqueID' by concatenating 'partNumber' with the automatic observation number (_N_) using the CATX function for each record in the 'Sales' dataset. Finally, prints the new 'SalesUnique' dataset with the 'uniqueID' variable.
Copied!
proc sort data=sales;
by partNumber;
run;
proc print data=sales; run;
proc freq data = Sales noprint;
tables partNumber / out = SalesDupes
(keep = partNumber Count
where = (Count > 1));
run;
proc print data=SalesDupes;run;
data SalesUnique;
set Sales;
uniqueID = catx('.',partNumber,_n_);
run;
proc print data=SalesUnique;
var uniqueID partNumber partName salesPerson;
run;
1
PROC SORTDATA=sales;
2
BY partNumber;
3
RUN;
4
PROC PRINTDATA=sales; RUN;
5
6
PROC FREQDATA = Sales noprint;
7
tables partNumber / out = SalesDupes
8
(keep = partNumber Count
9
where = (Count > 1));
10
RUN;
11
12
PROC PRINTDATA=SalesDupes;RUN;
13
14
DATA SalesUnique;
15
SET Sales;
16
uniqueID = catx('.',partNumber,_n_);
17
RUN;
18
19
PROC PRINTDATA=SalesUnique;
20
var uniqueID partNumber partName salesPerson;
21
RUN;
4 Code Block
DATA STEP / PROC CONTENTS / PROC MERGE Data
Explanation : Creates a 'cars' dataset from 'sashelp.cars'. Displays descriptive information for 'cars' and 'CarsSmall' datasets. Attempts to merge 'cars' and 'CarsSmallNum' by 'make' and 'model', but fails due to incompatible data types for the 'Weight' variable.
To correct the problem:
data CarsSmallNum;
set CarsSmall;
weightNum=input(weight, 8.);
drop weight;
rename WeightNum=weight;
run;
proc sort data=cars; by make model; run;
proc sort data=CarsSmallNum; by Make Model; run;
data combineCars;
merge cars CarsSmallNum;
by make model;
keep Make DriveTrain Model MakeModelDrive Weight;
run;
proc contents data=combineCars; run;
proc print data=combineCars; run;
Copied!
data cars;
set sashelp.cars;
run;
proc contents data=cars; run;
proc contents data=CarsSmall; run;
data combineCars;
merge cars CarsSmallNum;
by make model;
keep Make DriveTrain Model MakeModelDrive Weight;
run;
1
DATA cars;
2
SET sashelp.cars;
3
RUN;
4
5
PROC CONTENTSDATA=cars; RUN;
6
PROC CONTENTSDATA=CarsSmall; RUN;
7
8
DATA combineCars;
9
MERGE cars CarsSmallNum;
10
BY make model;
11
keep Make DriveTrain Model MakeModelDrive Weight;
12
RUN;
5 Code Block
PROC PRINT / PROC MERGE / DATA STEP Data
Explanation : Prints 'quarter1', 'quarter2', 'quarter3', 'quarter4' datasets. Attempts to merge these datasets by 'Account'. The first attempt may generate a warning due to different lengths for the 'Mileage' variable. The second DATA step uses the LENGTH statement to explicitly set the length of 'Mileage' to 6 before merging, ensuring consistent length in the 'yearly' dataset. PROC CONTENTS is then executed to verify the attributes of the merged dataset.
Explanation : Displays descriptive information of 'class' and 'classfit' datasets to identify common variables and compare their attributes, focusing on format and label differences for 'Height' and 'Weight'. Sorts datasets by 'Name'. Merges 'class' and 'classfit' by 'Name'. Uses the ATTRIB statement to explicitly set the label for 'Weight' and the 'comma8.2' format for 'Height', 'Weight', and 'Predict'. Prints and displays the attributes of the resulting 'merged' dataset.
Copied!
proc contents data=class; run;
proc contents data=classfit; run;
proc sort data=class; by name; run;
proc sort data=classfit; by name; run;
data merged;
merge class classfit; by Name;
attrib Weight
label = "Weight";
attrib Height Weight Predict format=comma8.2;
run;
proc print data=merged;
run;
proc contents data=merged; run;
1
PROC CONTENTSDATA=class; RUN;
2
PROC CONTENTSDATA=classfit; RUN;
3
4
PROC SORTDATA=class; BY name; RUN;
5
PROC SORTDATA=classfit; BY name; RUN;
6
7
DATA merged;
8
MERGE class classfit; BY Name;
9
attrib Weight
10
label = "Weight";
11
attrib Height Weight Predict FORMAT=comma8.2;
12
RUN;
13
PROC PRINTDATA=merged;
14
RUN;
15
PROC CONTENTSDATA=merged; RUN;
7 Code Block
DATA STEP Data
Explanation : Renames the 'Weight' variable to 'WeightLBS' in the 'vehicles' dataset during its creation or modification. This is done to differentiate variables that have the same name but represent different data, thus avoiding problems when merging with other datasets.
Copied!
data vehicles(rename=(weight=weightLBS));
set vehicles;
run;
1
2
DATA vehicles(rename=(weight=weightLBS));
3
SET vehicles;
4
RUN;
5
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
« In SAS development, the success of a MERGE or JOIN statement is determined long before the code is executed. Data preparation is the most critical phase of any analytical pipeline. Failing to standardize variable attributes—such as Type, Length, and BY-variable values—leads to "dirty merges," where data is overwritten incorrectly or observations are excluded without warning. »
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. WeAreCAS is an independent community site and is not affiliated with SAS Institute Inc.
This site uses technical and analytical cookies to improve your experience.
Read more.