Published on :
Data Manipulation CREATION_INTERNE

Examples: Prepare Data

This code is also available in: Deutsch Español
Detailed functional analysis of different data preparation methods. Includes techniques for examining data (PROC CONTENTS, PROC PRINT), finding common variables (PROC SQL), creating unique BY values in case of duplicates (PROC FREQ, CATX function), managing data type differences (INPUT function), adjusting variable lengths (LENGTH or ATTRIB statement), and standardizing formats and labels (ATTRIB statement). The examples aim to ensure compatibility of common variable attributes for successful data combination operations.
Data Analysis

Type : CREATION_INTERNE


Examples use generated data (datalines) or SASHELP, as well as predefined datasets in SAS documentation.

1 Code Block
PROC CONTENTS / PROC SORT / PROC PRINT
Explanation :
This example uses the PROC CONTENTS, PROC SORT, and PROC PRINT procedures to examine descriptive information, sort, and print the 'Inventory', 'Sales', and 'Sales2019' datasets before combining them. The objective is to verify the compatibility of common variables and identify relationships between datasets.
Copied!
1PROC CONTENTS DATA=Inventory; RUN; /* 1*/
2PROC CONTENTS DATA=Sales; RUN;
3PROC CONTENTS DATA=Sales2019; RUN;
4QUIT;
5PROC SORT DATA=inventory; BY PartNumber; RUN; /* 2*/
6PROC SORT DATA=Sales; BY PartNumber; RUN;
7PROC SORT DATA=Sales2019; BY PartNumber; RUN;
8 
9PROC PRINT DATA=inventory; RUN /* 3*/;
10PROC PRINT DATA=Sales; RUN;
11PROC PRINT DATA=Sales2019; RUN;
2 Code Block
PROC SQL
Explanation :
This example uses PROC SQL to find variables common to the 'Inventory', 'Sales', and 'Sales2019' datasets. It creates a temporary table 'commonvars' from 'dictionary.columns' and then selects the variable names that appear in all specified datasets.
Copied!
1PROC SQL;
2 title "Variables Common to Inventory, Sales, and Sales2019";
3 create TABLE commonvars as /* 1 */
4 select memname, upcase(name) as name
5 from dictionary.columns
6 where LIBNAME='WORK' and
7 memname in ('INVENTORY', 'SALES', 'SALES2019');
8 select name /* 2 */
9 from commonvars
10 group BY name
11 having count(*)=(select count(distinct(memname)) from commonvars);
12QUIT;
3 Code Block
PROC SORT / PROC FREQ / DATA STEP
Explanation :
This example shows how to handle duplicate values in a BY variable. It uses PROC FREQ to identify duplicates of 'partNumber' in the 'Sales' dataset, then a DATA step with the CATX function and the automatic variable _N_ to create a unique identifier ('uniqueID') for each observation, including duplicates.
Copied!
1PROC SORT DATA=sales; /* 1 */
2 BY partNumber;
3RUN;
4PROC PRINT DATA=sales; RUN;
5 
6PROC FREQ DATA = Sales noprint; /* 2 */
7 tables partNumber / out = SalesDupes
8 (keep = partNumber Count
9 where = (Count > 1));
10RUN;
11 
12PROC PRINT DATA=SalesDupes;RUN;
13 
14DATA SalesUnique; /* 3 */
15 SET Sales;
16 uniqueID = catx('.',partNumber,_n_);
17RUN;
18 
19PROC PRINT DATA=SalesUnique; /* 4 */
20 var uniqueID partNumber partName salesPerson;
21RUN;
4 Code Block
DATA STEP / PROC CONTENTS
Explanation :
This example illustrates an incompatible data type issue when merging datasets. Initially, a merge attempt fails because the 'Weight' variable is of type CHAR in 'CarsSmall' and NUMERIC in 'Cars'. To resolve this, the example shows how to convert the 'Weight' variable from 'CarsSmall' to numeric using the INPUT function and rename it, before proceeding with the merge.
Copied!
1DATA cars; /* 1 */
2 SET sashelp.cars;
3RUN;
4 
5PROC CONTENTS DATA=cars; RUN; /* 2 */
6PROC CONTENTS DATA=CarsSmall; RUN;
7 
8DATA combineCars;
9 MERGE cars CarsSmallNum; /* 3*/
10 BY make model;
11 keep Make DriveTrain Model MakeModelDrive Weight;
12RUN;
5 Code Block
DATA STEP / PROC SORT / PROC CONTENTS / PROC PRINT
Explanation :
Correction of the previous example: This code resolves the data type incompatibility problem by creating a new dataset 'CarsSmallNum' where the 'Weight' variable is converted to numeric using the INPUT function. The original variable is dropped and the new one is renamed. Then, the datasets are sorted and successfully merged. PROC CONTENTS and PROC PRINT are used to verify the result.
Copied!
1DATA CarsSmallNum; /* 1 */
2 SET CarsSmall;
3 weightNum=INPUT(weight, 8.);
4 drop weight;
5 rename WeightNum=weight;
6RUN;
7 
8PROC SORT DATA=cars; BY make model; RUN; /* 2 */
9PROC SORT DATA=CarsSmallNum; BY Make Model; RUN;
10 
11DATA combineCars;
12 MERGE cars CarsSmallNum; /* 3 */
13 BY make model;
14 keep Make DriveTrain Model MakeModelDrive Weight;
15RUN;
16PROC CONTENTS DATA=combineCars; RUN; /* 4*/
17PROC PRINT DATA=combineCars; RUN;
6 Code Block
PROC PRINT / DATA STEP / PROC CONTENTS
Explanation :
This example demonstrates how to handle different lengths for a common variable ('Mileage') when merging datasets. It shows that SAS by default uses the length of the first table listed in the MERGE statement. To explicitly control the length, the LENGTH statement must be used before the MERGE statement in the DATA step.
Copied!
1PROC PRINT DATA=quarter1; RUN; /* 1 */
2PROC PRINT DATA=quarter2; RUN;
3PROC PRINT DATA=quarter3; RUN;
4PROC PRINT DATA=quarter4; RUN;
5 
6DATA yearly;
7 MERGE quarter1 quarter2 quarter3 quarter4; /* 2 */
8 BY Account;
9RUN;
10 
11DATA yearly; /* 3 */
12 LENGTH Mileage 6;
13 MERGE quarter1 quarter2 quarter3 quarter4;
14 BY Account;
15RUN;
16PROC CONTENTS DATA=yearly; RUN; /* 4 */
7 Code Block
DATA STEP
Explanation :
Correction of the previous example: This version shows how to use the ATTRIB statement to explicitly specify the length of the 'Mileage' variable to 6 bytes before merging the datasets, thus ensuring that the desired length is applied in the output dataset.
Copied!
1DATA yearly;
2 MERGE quarter1 quarter2 quarter3 quarter4;
3 BY Account;
4 attrib Mileage
5 LENGTH = 6;
6RUN;
8 Code Block
PROC CONTENTS / PROC SORT / DATA STEP / PROC PRINT
Explanation :
This example handles differences in formats and labels for common variables ('Height' and 'Weight') when merging 'class' and 'classfit' datasets. It shows how the ATTRIB statement can be used in a DATA step to explicitly define the label and format of variables in the output dataset, overriding SAS's default behavior of taking attributes from the first dataset listed.
Copied!
1PROC CONTENTS DATA=class; RUN; /* 1 */
2PROC CONTENTS DATA=classfit; RUN;
3 
4PROC SORT DATA=class; BY name; RUN; /* 2 */
5PROC SORT DATA=classfit; BY name; RUN;
6 
7DATA merged;
8 MERGE class classfit; BY Name;
9 attrib Weight
10 label = "Weight";
11 attrib Height Weight Predict FORMAT=comma8.2; /* 3 */
12RUN;
13PROC PRINT DATA=merged; /* 4 */
14RUN;
15PROC CONTENTS DATA=merged; RUN;
9 Code Block
DATA STEP
Explanation :
This example addresses the case where variables have the same name but represent different data (and are therefore not truly common variables). The solution is to use the RENAME= dataset option in the DATA statement to rename the 'Weight' variable to 'weightLBS' in the 'vehicles' dataset (from CarsSmall), thus avoiding a conflict or incorrect merge when combining data.
Copied!
1 
2DATA vehicles(rename=(weight=weightLBS));
3SET vehicles;
4RUN;
5 
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : Copyright © SAS Institute Inc. All Rights Reserved