Published on :
ETL CREATION_INTERNE

Analysis and merging of corporate and financial data

This code is also available in: Deutsch Español Français
Awaiting validation
The script begins by defining two temporary datasets, 'company' and 'finance', using 'datalines' blocks for direct data input. The 'company' dataset contains demographic information about employees (Name, Age, Gender), while 'finance' contains financial details (Phone Number, Name, Salary). Each dataset is sorted by 'Name' using PROC SORT, which is an essential preparatory step for merging. The sorted datasets are then printed. The script then merges 'company' and 'finance' into a new dataset 'employeeinfo' based on the common variable 'Name'. Additional PROC PRINT procedures are used to display the merged dataset, as well as to demonstrate the use of the global options 'firstobs' and 'obs', the 'WHERE' clause for filtering, and an attempt to calculate a sum for a categorical variable (which is noted as a potential error). Finally, PROC CONTENTS and PROC DATASETS are used to examine the metadata of the datasets, including 'company'.
Data Analysis

Type : CREATION_INTERNE


Data for the 'company' and 'finance' datasets are created directly within the SAS script via 'datalines' blocks. No external data or SASHELP libraries are used as primary data sources for these datasets.

1 Code Block
DATA STEP Data
Explanation :
This DATA STEP block creates the 'company' dataset using in-line data provided in the 'datalines' block. It defines the 'Name' (character), 'Age' (numeric), and 'Gender' (character) variables with their respective column positions. This dataset contains basic employee information.
Copied!
1DATA company;
2INPUT
3Name $ 1-25
4Age 27-28
5Gender $ 30;
6 DATALINES;
7Vincent, Martina 34 F
8Phillipon, Marie-Odile 28 F
9Gunter, Thomas 27 M
10Harbinger, Nicholas 36 M
11Benito, Gisela 32 F
12Rudelich, Herbert 39 M
13Sirignano, Emily 12 F
14Morrison, Michael 32 M
15;
16 
2 Code Block
PROC SORT
Explanation :
This PROC SORT sorts the 'company' dataset by the 'Name' variable. Sorting is essential for subsequent operations like merging ('MERGE') which typically require input datasets to be sorted by the merge variables ('BY' variables).
Copied!
1PROC SORT DATA=company;
2 BY Name;
3RUN;
4 
3 Code Block
PROC PRINT
Explanation :
This PROC PRINT displays the contents of the 'company' dataset in the output window. The 'title' specifies the title of the printed report, making it more descriptive.
Copied!
1PROC PRINT DATA=company;
2 title 'Company';
3RUN;
4 
4 Code Block
DATA STEP Data
Explanation :
This second DATA STEP block creates the 'finance' dataset, similar to the 'company' block, using 'datalines'. It captures the 'Phone_number' (character), 'Name' (character), and 'Salary' (numeric) variables, containing financial information for certain employees.
Copied!
1DATA finance;
2 INPUT Phone_number $ 1-11
3 Name $ 13-40
4 Salary;
5 DATALINES;
6074-53-9892 Vincent, Martina 35000
7776-84-5391 Phillipon, Marie-Odile 29750
8929-75-0218 Gunter, Thomas 27500
9446-93-2122 Harbinger, Nicholas 33900
10228-88-9649 Benito, Gisela 28000
11029-46-9261 Rudelich, Herbert 35000
12442-21-8075 Sirignano, Emily 5000
13;
14 
5 Code Block
PROC SORT
Explanation :
Similar to sorting the 'company' dataset, this PROC SORT sorts the 'finance' dataset by the 'Name' variable in preparation for the merge operation.
Copied!
1PROC SORT DATA=finance;
2 BY Name;
3RUN;
4 
6 Code Block
PROC PRINT
Explanation :
This PROC PRINT displays the contents of the 'finance' dataset with the title 'Employee Information', providing a view of individual financial data.
Copied!
1PROC PRINT DATA=finance;
2 title 'Employee Information';
3RUN;
4 
7 Code Block
DATA STEP Data
Explanation :
This DATA STEP merges the 'company' and 'finance' datasets into a new dataset called 'employeeinfo'. The 'BY name;' clause indicates that the merge should be performed by matching observations based on the 'Name' variable. If a name appears in one dataset but not the other, the values of the non-matching variables will be missing. The input datasets must be sorted by the 'BY' variable.
Copied!
1DATA employeeinfo;
2 MERGE company finance;
3 BY name;
4RUN;
5 
8 Code Block
PROC PRINT
Explanation :
This PROC PRINT displays the 'employeeinfo' dataset which is the result of the merge. The use of 'title' and 'title2' allows for two lines of title for the report output.
Copied!
1PROC PRINT DATA=employeeinfo;
2 title 'Laxmi is Cute!';
3 title2 'very cute';
4RUN;
5 
9 Code Block
PROC CONTENTS
Explanation :
PROC CONTENTS provides detailed metadata about the 'company' dataset. This includes information such as the number of observations, the number of variables, variable names, their types (numeric or character), their lengths, and their formats.
Copied!
1PROC CONTENTS DATA = company;
2RUN;
3 
10 Code Block
PROC DATASETS
Explanation :
PROC DATASETS is a library management procedure that allows various operations on datasets and catalogs. Here, the 'contents data= company;' statement is used to obtain the same metadata as PROC CONTENTS for the 'company' dataset. It is often used in a context where multiple dataset management operations are performed within a single PROC DATASETS execution.
Copied!
1PROC DATASETS;
2contents DATA= company;
3RUN;
4 
11 Code Block
OPTIONS et PROC PRINT
Explanation :
This block demonstrates the use of the global options 'firstobs' and 'obs'. 'options firstobs=3;' instructs SAS to start reading data from the third observation, and 'obs=9;' instructs it to stop after the ninth observation. Therefore, this PROC PRINT will display observations 3 through 9 (inclusive) of the 'employeeinfo' dataset. These options affect all subsequent data steps and procedures until they are reset or changed.
Copied!
1options firstobs= 3 obs=9;
2PROC PRINT DATA = employeeinfo;
3RUN;
4 
12 Code Block
PROC PRINT
Explanation :
This PROC PRINT uses a 'WHERE' clause to filter observations from the 'employeeinfo' dataset. Only observations where the 'Name' variable contains the substring 'Vi' (case-insensitive by default in SAS for 'contains' unless the option is changed) will be displayed. This allows targeting specific subsets of data for display.
Copied!
1PROC PRINT DATA = employeeinfo;
2where Name contains 'Vi';
3RUN;
4 
13 Code Block
PROC PRINT
Explanation :
This PROC PRINT attempts to generate a report with subtotals. The 'by Name;' statement groups observations by name, and 'sum Gender;' is intended to calculate totals for the 'Gender' variable. However, 'Gender' is a character variable, and the 'SUM' statement in PROC PRINT is designed for numeric variables. This will result in a warning or error in the SAS log, and the sum for 'Gender' will not be displayed. The 'ID Gender;' statement means that the 'Gender' variable will be used as an identification variable in the report, but is not typically used in this context for meaningful grouping with 'BY' and 'SUM' on a categorical variable.
Copied!
1PROC PRINT DATA = employeeinfo;
2title 'Sub total';
3sum Gender;
4BY Name;
5ID Gender;
6RUN;
7 
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.