Mastering Data Imputation in SAS Viya: Using SINGLE=YES for Accurate Time-Series

Difficulty Level

Beginner

Published on : 07/05/2021

Expert Advice

Michael
Responsable de l'infrastructure Viya.

If your dataset is too large for single-threaded processing, consider using a CAS Data Step with a partition by clause. This allows you to run multiple threads in parallel—one for each group—as long as your imputation logic stays within the boundaries of those groups.

The LAG function is crucial for sequential processing in a DATA Step, especially for data imputation. In a SAS^© Viya^™ environment, if data is processed in distributed memory (CAS), it is imperative to use the SESSREF='casauto' SINGLE=YES option in the DATA statement. This option ensures that processing occurs on a single thread on a single CAS node, thereby avoiding unpredictable results due to data distribution that would prevent the LAG function from reliably seeing the previous observation. The example shows a conditional imputation where, if the current value is missing, it is estimated from the previous year's value (increased by 10%). If the previous value is also missing, the imputation is based on the value from two years prior (increased by 20%).

Data Analysis

Type : CREATION_INTERNE

The examples use generated data (datalines) to ensure autonomy and reproducibility.

1 Code Block

DATA STEP Data

Explanation :
This example illustrates a simple imputation. If sales for a year are missing, they are imputed by increasing the previous year's sales by 5%. The `last_ventes` variable is used to retain the last known value, which is essential if the LAG-based imputation also encounters a missing value. The `single=yes` option is specified to ensure sequential processing on a single CAS node, which is necessary for the `LAG` function in a distributed environment.

Copied!

1	LIBNAME mycas cas;
2
3	DATA mycas.ventes_annuelles;
4	INPUT Annee Ventes;
5	DATALINES;
6	2020 1000
7	2021 1100
8	2022 .
9	2023 1300
10	2024 .
11	;
12	RUN;
13
14	DATA mycas.ventes_imputees (sessref="casauto" single=yes);
15	SET mycas.ventes_annuelles;
16	retain last_ventes;
17	IF Ventes = . THEN DO;
18	lag_ventes = lag(Ventes);
19	IF lag_ventes ne . THEN Ventes_imputees = lag_ventes * 1.05; /* Imputation avec +5% de l'année précédente */
20	ELSE Ventes_imputees = last_ventes * 1.05; /* Utilise la dernière vente connue si LAG est aussi manquant */
21	END;
22	ELSE DO;
23	Ventes_imputees = Ventes;
24	last_ventes = Ventes;
25	END;
26	RUN;
27
28	PROC PRINT DATA=mycas.ventes_imputees;
29	title "Exemple 1: Imputation Basique des Ventes Manquantes";
30	RUN;
31
32	PROC CAS;
33	TABLE.dropTable / caslib="mycas" name="ventes_annuelles";
34	TABLE.dropTable / caslib="mycas" name="ventes_imputees";
35	RUN;
36

2 Code Block

DATA STEP Data

Explanation :
This advanced example uses both `LAG` and `LAG2` to handle consecutive missing values. For each product (defined by `BY Produit`), if the quantity is missing, the script first attempts to impute it with the quantity from the previous month (increased by 2%). If the quantity from the previous month is also missing, it uses the quantity from two months prior (increased by 4%). If no previous value is available, the quantity is imputed as 0. Initializing `lag_quantite` and `lag2_quantite` for each new `Produit` group is crucial for correct calculations.

Copied!

1	LIBNAME mycas cas;
2
3	DATA mycas.stock_produit;
4	INPUT Produit $ Mois Quantite;
5	DATALINES;
6	A 1 100
7	A 2 110
8	A 3 .
9	A 4 .
10	A 5 130
11	B 1 200
12	B 2 .
13	B 3 220
14	B 4 230
15	;
16	RUN;
17
18	DATA mycas.stock_impute (sessref="casauto" single=yes);
19	SET mycas.stock_produit;
20	BY Produit;
21	IF first.Produit THEN DO;
22	lag_quantite = .;
23	lag2_quantite = .;
24	END;
25	ELSE DO;
26	lag_quantite = lag(Quantite);
27	lag2_quantite = lag2(Quantite);
28	END;
29
30	IF Quantite = . THEN DO;
31	IF lag_quantite ne . THEN Quantite_imputee = lag_quantite * 1.02; /* +2% de la valeur précédente */
32	ELSE IF lag2_quantite ne . THEN Quantite_imputee = lag2_quantite * 1.04; /* +4% de la valeur d'avant l'année précédente */
33	ELSE Quantite_imputee = 0; /* Si aucune valeur précédente n'est disponible */
34	END;
35	ELSE Quantite_imputee = Quantite;
36	RUN;
37
38	PROC PRINT DATA=mycas.stock_impute;
39	title "Exemple 2: Imputation de Quantités Manquantes avec LAG et LAG2";
40	RUN;
41
42	PROC CAS;
43	TABLE.dropTable / caslib="mycas" name="stock_produit";
44	TABLE.dropTable / caslib="mycas" name="stock_impute";
45	RUN;
46

3 Code Block

DATA STEP Data

Explanation :
This example implements advanced imputation with BY-group processing (`BY Departement Annee`). For each department and year, missing 'Reclamations' values are imputed with the previous month's value. If the previous value is also missing, the last non-missing value within the same group is used. `RETAIN` and `FIRST.BYVAR` are used to manage state across groups and reset `LAG` at the beginning of each new group. This is particularly useful for panel data where sequences must be respected by group.

Copied!

1	LIBNAME mycas cas;
2
3	DATA mycas.reclamations_mensuelles;
4	INPUT Departement $ Annee Mois Reclamations;
5	DATALINES;
6	IT 2022 1 10
7	IT 2022 2 12
8	IT 2022 3 .
9	IT 2022 4 15
10	IT 2023 1 .
11	IT 2023 2 18
12	HR 2022 1 20
13	HR 2022 2 .
14	HR 2022 3 25
15	HR 2023 1 22
16	HR 2023 2 .
17	HR 2023 3 .
18	HR 2023 4 30
19	;
20	RUN;
21
22	DATA mycas.reclamations_imputees (sessref="casauto" single=yes);
23	SET mycas.reclamations_mensuelles;
24	BY Departement Annee;
25	retain last_reclamations_in_group;
26
27	IF first.Departement or first.Annee THEN DO;
28	lag_reclamations = .;
29	last_reclamations_in_group = .;
30	END;
31	ELSE lag_reclamations = lag(Reclamations);
32
33	IF Reclamations = . THEN DO;
34	IF lag_reclamations ne . THEN Reclamations_imputees = lag_reclamations;
35	ELSE IF last_reclamations_in_group ne . THEN Reclamations_imputees = last_reclamations_in_group;
36	ELSE Reclamations_imputees = 0; /* Imputation par 0 si aucune valeur précédente */
37	END;
38	ELSE DO;
39	Reclamations_imputees = Reclamations;
40	last_reclamations_in_group = Reclamations;
41	END;
42	RUN;
43
44	PROC PRINT DATA=mycas.reclamations_imputees;
45	title "Exemple 3: Imputation de Réclamations Manquantes par Département et Année";
46	RUN;
47
48	PROC CAS;
49	TABLE.dropTable / caslib="mycas" name="reclamations_mensuelles";
50	TABLE.dropTable / caslib="mycas" name="reclamations_imputees";
51	RUN;
52

4 Code Block

DATA STEP Data

Explanation :
This example demonstrates integration in SAS Viya/CAS for response time imputation. It uses `SESSREF="casauto" SINGLE=YES` to ensure that `LAG` operations are performed correctly on a single thread in the distributed CAS environment. Missing values are imputed using the last known value within each `Service`. If no previous value is available (e.g., for the first observation of a service with a missing value), a default value of 30 is used. This case highlights the robustness of the imputation logic and the handling of edge cases.

Copied!

1	LIBNAME mycas cas;
2
3	DATA mycas.temps_reponse;
4	INPUT Service $ Date : yymmdd. TempsReponse;
5	FORMAT Date yymmdd10.;
6	DATALINES;
7	Email 20230101 15
8	Email 20230102 20
9	Email 20230103 .
10	Email 20230104 .
11	Email 20230105 25
12	Chat 20230101 5
13	Chat 20230102 .
14	Chat 20230103 8
15	;
16	RUN;
17
18	DATA mycas.temps_reponse_impute (sessref="casauto" single=yes);
19	SET mycas.temps_reponse;
20	BY Service;
21	retain last_temps_reponse_service;
22
23	IF first.Service THEN DO;
24	lag_temps_reponse = .;
25	last_temps_reponse_service = .;
26	END;
27	ELSE lag_temps_reponse = lag(TempsReponse);
28
29	IF TempsReponse = . THEN DO;
30	IF lag_temps_reponse ne . THEN TempsReponse_impute = lag_temps_reponse;
31	ELSE IF last_temps_reponse_service ne . THEN TempsReponse_impute = last_temps_reponse_service;
32	ELSE TempsReponse_impute = 30; /* Imputation par défaut à 30 si aucune valeur précédente */
33	END;
34	ELSE DO;
35	TempsReponse_impute = TempsReponse;
36	last_temps_reponse_service = TempsReponse;
37	END;
38	RUN;
39
40	PROC PRINT DATA=mycas.temps_reponse_impute;
41	title "Exemple 4: Imputation de Temps de Réponse Manquants dans CAS";
42	RUN;
43
44	PROC CAS;
45	TABLE.dropTable / caslib="mycas" name="temps_reponse";
46	TABLE.dropTable / caslib="mycas" name="temps_reponse_impute";
47	RUN;
48

Pro Tip

If your dataset is too large for single-threaded processing, consider using a CAS Data Step with a partition by clause. This allows you to run multiple threads in parallel—one for each group—as long as your imputation logic stays within the boundaries of those groups.

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Difficulty Level

Published on : 07/05/2021

Expert Advice

Data Analysis

1 Code Block

2 Code Block

3 Code Block

4 Code Block

Pro Tip

Related Documentation