dataPreprocess catTrans

HR Analytics: Handling Missing Data with One-Hot Encoding

Scénario de test & Cas d'usage

Business Context

The HR department is building a neural network to predict employee attrition. The dataset contains a 'Certification' field which is often empty (NULL). In this business context, a missing value is not an error but indicates 'No Certification'. The goal is to perform One-Hot encoding (creating binary flag columns) while explicitly treating the Missing value as a valid category (e.g., creating a column 'Certification_Missing' = 1).
About the Set : dataPreprocess

Data cleaning, imputation, and preprocessing.

Discover all actions of dataPreprocess
Data Preparation

Dataset with intentional missing values in a categorical column.

Copied!
1DATA casuser.employee_data;
2 LENGTH certification $10;
3 DO i = 1 to 100;
4 r = rand('Uniform');
5 IF r < 0.3 THEN certification = 'PMP';
6 ELSE IF r < 0.6 THEN certification = 'MBA';
7 ELSE call missing(certification); /* Explicit Missing Value */
8 OUTPUT;
9 END;
10RUN;

Étapes de réalisation

1
One-Hot encoding with explicit missing value inclusion
Copied!
1PROC CAS;
2 dataPreprocess.catTrans /
3 TABLE={name='employee_data', caslib='casuser'},
4 method='ONEHOT',
5 inputs={{name='certification'}},
6 includeMissingGroup=true,
7 casOut={name='employee_encoded', caslib='casuser', replace=true};
8RUN;
9QUIT;

Expected Result


The output table 'employee_encoded' will contain binary columns for each certification type (e.g., 'certification_PMP', 'certification_MBA'). Crucially, because 'includeMissingGroup=true' was set, there will be a specific column/bin representing the missing values, ensuring that the 'No Certification' status is actively captured as a feature for the neural network.