ETL SAS VIYA CAS

Mastering Data Imputation in SAS Viya: Using SINGLE=YES for Accurate Time-Series

This code is also available in: Deutsch Español
Difficulty Level
Beginner
Published on :
Michael

Expert Advice

Michael
Responsable de l'infrastructure Viya.

If your dataset is too large for single-threaded processing, consider using a CAS Data Step with a partition by clause. This allows you to run multiple threads in parallel—one for each group—as long as your imputation logic stays within the boundaries of those groups.

The LAG function is crucial for sequential processing in a DATA Step, especially for data imputation. In a SAS© Viya environment, if data is processed in distributed memory (CAS), it is imperative to use the SESSREF='casauto' SINGLE=YES option in the DATA statement. This option ensures that processing occurs on a single thread on a single CAS node, thereby avoiding unpredictable results due to data distribution that would prevent the LAG function from reliably seeing the previous observation. The example shows a conditional imputation where, if the current value is missing, it is estimated from the previous year's value (increased by 10%). If the previous value is also missing, the imputation is based on the value from two years prior (increased by 20%).
Data Analysis

Type : CREATION_INTERNE


The examples use generated data (datalines) to ensure autonomy and reproducibility.

1 Code Block
DATA STEP Data
Explanation :
This example illustrates a simple imputation. If sales for a year are missing, they are imputed by increasing the previous year's sales by 5%. The `last_ventes` variable is used to retain the last known value, which is essential if the LAG-based imputation also encounters a missing value. The `single=yes` option is specified to ensure sequential processing on a single CAS node, which is necessary for the `LAG` function in a distributed environment.
Copied!
1LIBNAME mycas cas;
2 
3DATA mycas.ventes_annuelles;
4 INPUT Annee Ventes;
5 DATALINES;
6 2020 1000
7 2021 1100
8 2022 .
9 2023 1300
10 2024 .
11 ;
12RUN;
13 
14DATA mycas.ventes_imputees (sessref="casauto" single=yes);
15 SET mycas.ventes_annuelles;
16 retain last_ventes;
17 IF Ventes = . THEN DO;
18 lag_ventes = lag(Ventes);
19 IF lag_ventes ne . THEN Ventes_imputees = lag_ventes * 1.05; /* Imputation avec +5% de l'année précédente */
20 ELSE Ventes_imputees = last_ventes * 1.05; /* Utilise la dernière vente connue si LAG est aussi manquant */
21 END;
22 ELSE DO;
23 Ventes_imputees = Ventes;
24 last_ventes = Ventes;
25 END;
26RUN;
27 
28PROC PRINT DATA=mycas.ventes_imputees;
29 title "Exemple 1: Imputation Basique des Ventes Manquantes";
30RUN;
31 
32PROC CAS;
33 TABLE.dropTable / caslib="mycas" name="ventes_annuelles";
34 TABLE.dropTable / caslib="mycas" name="ventes_imputees";
35RUN;
36 
2 Code Block
DATA STEP Data
Explanation :
This advanced example uses both `LAG` and `LAG2` to handle consecutive missing values. For each product (defined by `BY Produit`), if the quantity is missing, the script first attempts to impute it with the quantity from the previous month (increased by 2%). If the quantity from the previous month is also missing, it uses the quantity from two months prior (increased by 4%). If no previous value is available, the quantity is imputed as 0. Initializing `lag_quantite` and `lag2_quantite` for each new `Produit` group is crucial for correct calculations.
Copied!
1LIBNAME mycas cas;
2 
3DATA mycas.stock_produit;
4 INPUT Produit $ Mois Quantite;
5 DATALINES;
6 A 1 100
7 A 2 110
8 A 3 .
9 A 4 .
10 A 5 130
11 B 1 200
12 B 2 .
13 B 3 220
14 B 4 230
15 ;
16RUN;
17 
18DATA mycas.stock_impute (sessref="casauto" single=yes);
19 SET mycas.stock_produit;
20 BY Produit;
21 IF first.Produit THEN DO;
22 lag_quantite = .;
23 lag2_quantite = .;
24 END;
25 ELSE DO;
26 lag_quantite = lag(Quantite);
27 lag2_quantite = lag2(Quantite);
28 END;
29 
30 IF Quantite = . THEN DO;
31 IF lag_quantite ne . THEN Quantite_imputee = lag_quantite * 1.02; /* +2% de la valeur précédente */
32 ELSE IF lag2_quantite ne . THEN Quantite_imputee = lag2_quantite * 1.04; /* +4% de la valeur d'avant l'année précédente */
33 ELSE Quantite_imputee = 0; /* Si aucune valeur précédente n'est disponible */
34 END;
35 ELSE Quantite_imputee = Quantite;
36RUN;
37 
38PROC PRINT DATA=mycas.stock_impute;
39 title "Exemple 2: Imputation de Quantités Manquantes avec LAG et LAG2";
40RUN;
41 
42PROC CAS;
43 TABLE.dropTable / caslib="mycas" name="stock_produit";
44 TABLE.dropTable / caslib="mycas" name="stock_impute";
45RUN;
46 
3 Code Block
DATA STEP Data
Explanation :
This example implements advanced imputation with BY-group processing (`BY Departement Annee`). For each department and year, missing 'Reclamations' values are imputed with the previous month's value. If the previous value is also missing, the last non-missing value within the same group is used. `RETAIN` and `FIRST.BYVAR` are used to manage state across groups and reset `LAG` at the beginning of each new group. This is particularly useful for panel data where sequences must be respected by group.
Copied!
1LIBNAME mycas cas;
2 
3DATA mycas.reclamations_mensuelles;
4 INPUT Departement $ Annee Mois Reclamations;
5 DATALINES;
6 IT 2022 1 10
7 IT 2022 2 12
8 IT 2022 3 .
9 IT 2022 4 15
10 IT 2023 1 .
11 IT 2023 2 18
12 HR 2022 1 20
13 HR 2022 2 .
14 HR 2022 3 25
15 HR 2023 1 22
16 HR 2023 2 .
17 HR 2023 3 .
18 HR 2023 4 30
19 ;
20RUN;
21 
22DATA mycas.reclamations_imputees (sessref="casauto" single=yes);
23 SET mycas.reclamations_mensuelles;
24 BY Departement Annee;
25 retain last_reclamations_in_group;
26 
27 IF first.Departement or first.Annee THEN DO;
28 lag_reclamations = .;
29 last_reclamations_in_group = .;
30 END;
31 ELSE lag_reclamations = lag(Reclamations);
32 
33 IF Reclamations = . THEN DO;
34 IF lag_reclamations ne . THEN Reclamations_imputees = lag_reclamations;
35 ELSE IF last_reclamations_in_group ne . THEN Reclamations_imputees = last_reclamations_in_group;
36 ELSE Reclamations_imputees = 0; /* Imputation par 0 si aucune valeur précédente */
37 END;
38 ELSE DO;
39 Reclamations_imputees = Reclamations;
40 last_reclamations_in_group = Reclamations;
41 END;
42RUN;
43 
44PROC PRINT DATA=mycas.reclamations_imputees;
45 title "Exemple 3: Imputation de Réclamations Manquantes par Département et Année";
46RUN;
47 
48PROC CAS;
49 TABLE.dropTable / caslib="mycas" name="reclamations_mensuelles";
50 TABLE.dropTable / caslib="mycas" name="reclamations_imputees";
51RUN;
52 
4 Code Block
DATA STEP Data
Explanation :
This example demonstrates integration in SAS Viya/CAS for response time imputation. It uses `SESSREF="casauto" SINGLE=YES` to ensure that `LAG` operations are performed correctly on a single thread in the distributed CAS environment. Missing values are imputed using the last known value within each `Service`. If no previous value is available (e.g., for the first observation of a service with a missing value), a default value of 30 is used. This case highlights the robustness of the imputation logic and the handling of edge cases.
Copied!
1LIBNAME mycas cas;
2 
3DATA mycas.temps_reponse;
4 INPUT Service $ Date : yymmdd. TempsReponse;
5 FORMAT Date yymmdd10.;
6 DATALINES;
7 Email 20230101 15
8 Email 20230102 20
9 Email 20230103 .
10 Email 20230104 .
11 Email 20230105 25
12 Chat 20230101 5
13 Chat 20230102 .
14 Chat 20230103 8
15 ;
16RUN;
17 
18DATA mycas.temps_reponse_impute (sessref="casauto" single=yes);
19 SET mycas.temps_reponse;
20 BY Service;
21 retain last_temps_reponse_service;
22 
23 IF first.Service THEN DO;
24 lag_temps_reponse = .;
25 last_temps_reponse_service = .;
26 END;
27 ELSE lag_temps_reponse = lag(TempsReponse);
28 
29 IF TempsReponse = . THEN DO;
30 IF lag_temps_reponse ne . THEN TempsReponse_impute = lag_temps_reponse;
31 ELSE IF last_temps_reponse_service ne . THEN TempsReponse_impute = last_temps_reponse_service;
32 ELSE TempsReponse_impute = 30; /* Imputation par défaut à 30 si aucune valeur précédente */
33 END;
34 ELSE DO;
35 TempsReponse_impute = TempsReponse;
36 last_temps_reponse_service = TempsReponse;
37 END;
38RUN;
39 
40PROC PRINT DATA=mycas.temps_reponse_impute;
41 title "Exemple 4: Imputation de Temps de Réponse Manquants dans CAS";
42RUN;
43 
44PROC CAS;
45 TABLE.dropTable / caslib="mycas" name="temps_reponse";
46 TABLE.dropTable / caslib="mycas" name="temps_reponse_impute";
47RUN;
48 
Pro Tip
If your dataset is too large for single-threaded processing, consider using a CAS Data Step with a partition by clause. This allows you to run multiple threads in parallel—one for each group—as long as your imputation logic stays within the boundaries of those groups.
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : Copyright © SAS Institute Inc. All Rights Reserved.


Related Documentation

Aucune documentation spécifique pour cette catégorie.