Examples: Prepare Data

Detailed functional analysis of different data preparation methods. Includes techniques for examining data (PROC CONTENTS, PROC PRINT), finding common variables (PROC SQL), creating unique BY values in case of duplicates (PROC FREQ, CATX function), managing data type differences (INPUT function), adjusting variable lengths (LENGTH or ATTRIB statement), and standardizing formats and labels (ATTRIB statement). The examples aim to ensure compatibility of common variable attributes for successful data combination operations.

Data Analysis

Type : CREATION_INTERNE

Examples use generated data (datalines) or SASHELP, as well as predefined datasets in SAS documentation.

1 Code Block

PROC CONTENTS / PROC SORT / PROC PRINT

Explanation :
This example uses the PROC CONTENTS, PROC SORT, and PROC PRINT procedures to examine descriptive information, sort, and print the 'Inventory', 'Sales', and 'Sales2019' datasets before combining them. The objective is to verify the compatibility of common variables and identify relationships between datasets.

Copied!

1	PROC CONTENTS DATA=Inventory; RUN; /* 1*/
2	PROC CONTENTS DATA=Sales; RUN;
3	PROC CONTENTS DATA=Sales2019; RUN;
4	QUIT;
5	PROC SORT DATA=inventory; BY PartNumber; RUN; /* 2*/
6	PROC SORT DATA=Sales; BY PartNumber; RUN;
7	PROC SORT DATA=Sales2019; BY PartNumber; RUN;
8
9	PROC PRINT DATA=inventory; RUN /* 3*/;
10	PROC PRINT DATA=Sales; RUN;
11	PROC PRINT DATA=Sales2019; RUN;

2 Code Block

PROC SQL

Explanation :
This example uses PROC SQL to find variables common to the 'Inventory', 'Sales', and 'Sales2019' datasets. It creates a temporary table 'commonvars' from 'dictionary.columns' and then selects the variable names that appear in all specified datasets.

Copied!

1	PROC SQL;
2	title "Variables Common to Inventory, Sales, and Sales2019";
3	create TABLE commonvars as /* 1 */
4	select memname, upcase(name) as name
5	from dictionary.columns
6	where LIBNAME='WORK' and
7	memname in ('INVENTORY', 'SALES', 'SALES2019');
8	select name /* 2 */
9	from commonvars
10	group BY name
11	having count(*)=(select count(distinct(memname)) from commonvars);
12	QUIT;

3 Code Block

PROC SORT / PROC FREQ / DATA STEP

Explanation :
This example shows how to handle duplicate values in a BY variable. It uses PROC FREQ to identify duplicates of 'partNumber' in the 'Sales' dataset, then a DATA step with the CATX function and the automatic variable _N_ to create a unique identifier ('uniqueID') for each observation, including duplicates.

Copied!

1	PROC SORT DATA=sales; /* 1 */
2	BY partNumber;
3	RUN;
4	PROC PRINT DATA=sales; RUN;
5
6	PROC FREQ DATA = Sales noprint; /* 2 */
7	tables partNumber / out = SalesDupes
8	(keep = partNumber Count
9	where = (Count > 1));
10	RUN;
11
12	PROC PRINT DATA=SalesDupes;RUN;
13
14	DATA SalesUnique; /* 3 */
15	SET Sales;
16	uniqueID = catx('.',partNumber,_n_);
17	RUN;
18
19	PROC PRINT DATA=SalesUnique; /* 4 */
20	var uniqueID partNumber partName salesPerson;
21	RUN;

4 Code Block

DATA STEP / PROC CONTENTS

Explanation :
This example illustrates an incompatible data type issue when merging datasets. Initially, a merge attempt fails because the 'Weight' variable is of type CHAR in 'CarsSmall' and NUMERIC in 'Cars'. To resolve this, the example shows how to convert the 'Weight' variable from 'CarsSmall' to numeric using the INPUT function and rename it, before proceeding with the merge.

Copied!

1	DATA cars; /* 1 */
2	SET sashelp.cars;
3	RUN;
4
5	PROC CONTENTS DATA=cars; RUN; /* 2 */
6	PROC CONTENTS DATA=CarsSmall; RUN;
7
8	DATA combineCars;
9	MERGE cars CarsSmallNum; /* 3*/
10	BY make model;
11	keep Make DriveTrain Model MakeModelDrive Weight;
12	RUN;

5 Code Block

DATA STEP / PROC SORT / PROC CONTENTS / PROC PRINT

Explanation :
Correction of the previous example: This code resolves the data type incompatibility problem by creating a new dataset 'CarsSmallNum' where the 'Weight' variable is converted to numeric using the INPUT function. The original variable is dropped and the new one is renamed. Then, the datasets are sorted and successfully merged. PROC CONTENTS and PROC PRINT are used to verify the result.

Copied!

1	DATA CarsSmallNum; /* 1 */
2	SET CarsSmall;
3	weightNum=INPUT(weight, 8.);
4	drop weight;
5	rename WeightNum=weight;
6	RUN;
7
8	PROC SORT DATA=cars; BY make model; RUN; /* 2 */
9	PROC SORT DATA=CarsSmallNum; BY Make Model; RUN;
10
11	DATA combineCars;
12	MERGE cars CarsSmallNum; /* 3 */
13	BY make model;
14	keep Make DriveTrain Model MakeModelDrive Weight;
15	RUN;
16	PROC CONTENTS DATA=combineCars; RUN; /* 4*/
17	PROC PRINT DATA=combineCars; RUN;

6 Code Block

PROC PRINT / DATA STEP / PROC CONTENTS

Explanation :
This example demonstrates how to handle different lengths for a common variable ('Mileage') when merging datasets. It shows that SAS by default uses the length of the first table listed in the MERGE statement. To explicitly control the length, the LENGTH statement must be used before the MERGE statement in the DATA step.

Copied!

1	PROC PRINT DATA=quarter1; RUN; /* 1 */
2	PROC PRINT DATA=quarter2; RUN;
3	PROC PRINT DATA=quarter3; RUN;
4	PROC PRINT DATA=quarter4; RUN;
5
6	DATA yearly;
7	MERGE quarter1 quarter2 quarter3 quarter4; /* 2 */
8	BY Account;
9	RUN;
10
11	DATA yearly; /* 3 */
12	LENGTH Mileage 6;
13	MERGE quarter1 quarter2 quarter3 quarter4;
14	BY Account;
15	RUN;
16	PROC CONTENTS DATA=yearly; RUN; /* 4 */

7 Code Block

DATA STEP

Explanation :
Correction of the previous example: This version shows how to use the ATTRIB statement to explicitly specify the length of the 'Mileage' variable to 6 bytes before merging the datasets, thus ensuring that the desired length is applied in the output dataset.

Copied!

1	DATA yearly;
2	MERGE quarter1 quarter2 quarter3 quarter4;
3	BY Account;
4	attrib Mileage
5	LENGTH = 6;
6	RUN;

8 Code Block

PROC CONTENTS / PROC SORT / DATA STEP / PROC PRINT

Explanation :
This example handles differences in formats and labels for common variables ('Height' and 'Weight') when merging 'class' and 'classfit' datasets. It shows how the ATTRIB statement can be used in a DATA step to explicitly define the label and format of variables in the output dataset, overriding SAS's default behavior of taking attributes from the first dataset listed.

Copied!

1	PROC CONTENTS DATA=class; RUN; /* 1 */
2	PROC CONTENTS DATA=classfit; RUN;
3
4	PROC SORT DATA=class; BY name; RUN; /* 2 */
5	PROC SORT DATA=classfit; BY name; RUN;
6
7	DATA merged;
8	MERGE class classfit; BY Name;
9	attrib Weight
10	label = "Weight";
11	attrib Height Weight Predict FORMAT=comma8.2; /* 3 */
12	RUN;
13	PROC PRINT DATA=merged; /* 4 */
14	RUN;
15	PROC CONTENTS DATA=merged; RUN;

9 Code Block

DATA STEP

Explanation :
This example addresses the case where variables have the same name but represent different data (and are therefore not truly common variables). The solution is to use the RENAME= dataset option in the DATA statement to rename the 'Weight' variable to 'weightLBS' in the 'vehicles' dataset (from CarsSmall), thus avoiding a conflict or incorrect merge when combining data.

Copied!

1
2	DATA vehicles(rename=(weight=weightLBS));
3	SET vehicles;
4	RUN;
5

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Retour à la liste