If your dataset is too large for single-threaded processing, consider using a CAS Data Step with a partition by clause. This allows you to run multiple threads in parallel—one for each group—as long as your imputation logic stays within the boundaries of those groups.
The examples use generated data (datalines) to ensure autonomy and reproducibility.
1 Code Block
DATA STEP Data
Explanation : This example illustrates a simple imputation. If sales for a year are missing, they are imputed by increasing the previous year's sales by 5%. The `last_ventes` variable is used to retain the last known value, which is essential if the LAG-based imputation also encounters a missing value. The `single=yes` option is specified to ensure sequential processing on a single CAS node, which is necessary for the `LAG` function in a distributed environment.
Copied!
libname mycas cas;
data mycas.ventes_annuelles;
input Annee Ventes;
datalines;
2020 1000
2021 1100
2022 .
2023 1300
2024 .
;
run;
data mycas.ventes_imputees (sessref="casauto" single=yes);
set mycas.ventes_annuelles;
retain last_ventes;
if Ventes = . then do;
lag_ventes = lag(Ventes);
if lag_ventes ne . then Ventes_imputees = lag_ventes * 1.05; /* Imputation avec +5% de l'année précédente */
else Ventes_imputees = last_ventes * 1.05; /* Utilise la dernière vente connue si LAG est aussi manquant */
end;
else do;
Ventes_imputees = Ventes;
last_ventes = Ventes;
end;
run;
proc print data=mycas.ventes_imputees;
title "Exemple 1: Imputation Basique des Ventes Manquantes";
run;
proc cas;
table.dropTable / caslib="mycas" name="ventes_annuelles";
table.dropTable / caslib="mycas" name="ventes_imputees";
run;
1
LIBNAME mycas cas;
2
3
DATA mycas.ventes_annuelles;
4
INPUT Annee Ventes;
5
DATALINES;
6
20201000
7
20211100
8
2022 .
9
20231300
10
2024 .
11
;
12
RUN;
13
14
DATA mycas.ventes_imputees (sessref="casauto" single=yes);
15
SET mycas.ventes_annuelles;
16
retain last_ventes;
17
IF Ventes = . THENDO;
18
lag_ventes = lag(Ventes);
19
IF lag_ventes ne . THEN Ventes_imputees = lag_ventes * 1.05; /* Imputation avec +5% de l'année précédente */
20
ELSE Ventes_imputees = last_ventes * 1.05; /* Utilise la dernière vente connue si LAG est aussi manquant */
21
END;
22
ELSEDO;
23
Ventes_imputees = Ventes;
24
last_ventes = Ventes;
25
END;
26
RUN;
27
28
PROC PRINTDATA=mycas.ventes_imputees;
29
title "Exemple 1: Imputation Basique des Ventes Manquantes";
Explanation : This advanced example uses both `LAG` and `LAG2` to handle consecutive missing values. For each product (defined by `BY Produit`), if the quantity is missing, the script first attempts to impute it with the quantity from the previous month (increased by 2%). If the quantity from the previous month is also missing, it uses the quantity from two months prior (increased by 4%). If no previous value is available, the quantity is imputed as 0. Initializing `lag_quantite` and `lag2_quantite` for each new `Produit` group is crucial for correct calculations.
Copied!
libname mycas cas;
data mycas.stock_produit;
input Produit $ Mois Quantite;
datalines;
A 1 100
A 2 110
A 3 .
A 4 .
A 5 130
B 1 200
B 2 .
B 3 220
B 4 230
;
run;
data mycas.stock_impute (sessref="casauto" single=yes);
set mycas.stock_produit;
by Produit;
if first.Produit then do;
lag_quantite = .;
lag2_quantite = .;
end;
else do;
lag_quantite = lag(Quantite);
lag2_quantite = lag2(Quantite);
end;
if Quantite = . then do;
if lag_quantite ne . then Quantite_imputee = lag_quantite * 1.02; /* +2% de la valeur précédente */
else if lag2_quantite ne . then Quantite_imputee = lag2_quantite * 1.04; /* +4% de la valeur d'avant l'année précédente */
else Quantite_imputee = 0; /* Si aucune valeur précédente n'est disponible */
end;
else Quantite_imputee = Quantite;
run;
proc print data=mycas.stock_impute;
title "Exemple 2: Imputation de Quantités Manquantes avec LAG et LAG2";
run;
proc cas;
table.dropTable / caslib="mycas" name="stock_produit";
table.dropTable / caslib="mycas" name="stock_impute";
run;
1
LIBNAME mycas cas;
2
3
DATA mycas.stock_produit;
4
INPUT Produit $ Mois Quantite;
5
DATALINES;
6
A 1100
7
A 2110
8
A 3 .
9
A 4 .
10
A 5130
11
B 1200
12
B 2 .
13
B 3220
14
B 4230
15
;
16
RUN;
17
18
DATA mycas.stock_impute (sessref="casauto" single=yes);
19
SET mycas.stock_produit;
20
BY Produit;
21
IF first.Produit THENDO;
22
lag_quantite = .;
23
lag2_quantite = .;
24
END;
25
ELSEDO;
26
lag_quantite = lag(Quantite);
27
lag2_quantite = lag2(Quantite);
28
END;
29
30
IF Quantite = . THENDO;
31
IF lag_quantite ne . THEN Quantite_imputee = lag_quantite * 1.02; /* +2% de la valeur précédente */
32
ELSEIF lag2_quantite ne . THEN Quantite_imputee = lag2_quantite * 1.04; /* +4% de la valeur d'avant l'année précédente */
33
ELSE Quantite_imputee = 0; /* Si aucune valeur précédente n'est disponible */
34
END;
35
ELSE Quantite_imputee = Quantite;
36
RUN;
37
38
PROC PRINTDATA=mycas.stock_impute;
39
title "Exemple 2: Imputation de Quantités Manquantes avec LAG et LAG2";
Explanation : This example implements advanced imputation with BY-group processing (`BY Departement Annee`). For each department and year, missing 'Reclamations' values are imputed with the previous month's value. If the previous value is also missing, the last non-missing value within the same group is used. `RETAIN` and `FIRST.BYVAR` are used to manage state across groups and reset `LAG` at the beginning of each new group. This is particularly useful for panel data where sequences must be respected by group.
Copied!
libname mycas cas;
data mycas.reclamations_mensuelles;
input Departement $ Annee Mois Reclamations;
datalines;
IT 2022 1 10
IT 2022 2 12
IT 2022 3 .
IT 2022 4 15
IT 2023 1 .
IT 2023 2 18
HR 2022 1 20
HR 2022 2 .
HR 2022 3 25
HR 2023 1 22
HR 2023 2 .
HR 2023 3 .
HR 2023 4 30
;
run;
data mycas.reclamations_imputees (sessref="casauto" single=yes);
set mycas.reclamations_mensuelles;
by Departement Annee;
retain last_reclamations_in_group;
if first.Departement or first.Annee then do;
lag_reclamations = .;
last_reclamations_in_group = .;
end;
else lag_reclamations = lag(Reclamations);
if Reclamations = . then do;
if lag_reclamations ne . then Reclamations_imputees = lag_reclamations;
else if last_reclamations_in_group ne . then Reclamations_imputees = last_reclamations_in_group;
else Reclamations_imputees = 0; /* Imputation par 0 si aucune valeur précédente */
end;
else do;
Reclamations_imputees = Reclamations;
last_reclamations_in_group = Reclamations;
end;
run;
proc print data=mycas.reclamations_imputees;
title "Exemple 3: Imputation de Réclamations Manquantes par Département et Année";
run;
proc cas;
table.dropTable / caslib="mycas" name="reclamations_mensuelles";
table.dropTable / caslib="mycas" name="reclamations_imputees";
run;
1
LIBNAME mycas cas;
2
3
DATA mycas.reclamations_mensuelles;
4
INPUT Departement $ Annee Mois Reclamations;
5
DATALINES;
6
IT 2022110
7
IT 2022212
8
IT 20223 .
9
IT 2022415
10
IT 20231 .
11
IT 2023218
12
HR 2022120
13
HR 20222 .
14
HR 2022325
15
HR 2023122
16
HR 20232 .
17
HR 20233 .
18
HR 2023430
19
;
20
RUN;
21
22
DATA mycas.reclamations_imputees (sessref="casauto" single=yes);
23
SET mycas.reclamations_mensuelles;
24
BY Departement Annee;
25
retain last_reclamations_in_group;
26
27
IF first.Departement or first.Annee THENDO;
28
lag_reclamations = .;
29
last_reclamations_in_group = .;
30
END;
31
ELSE lag_reclamations = lag(Reclamations);
32
33
IF Reclamations = . THENDO;
34
IF lag_reclamations ne . THEN Reclamations_imputees = lag_reclamations;
35
ELSEIF last_reclamations_in_group ne . THEN Reclamations_imputees = last_reclamations_in_group;
36
ELSE Reclamations_imputees = 0; /* Imputation par 0 si aucune valeur précédente */
37
END;
38
ELSEDO;
39
Reclamations_imputees = Reclamations;
40
last_reclamations_in_group = Reclamations;
41
END;
42
RUN;
43
44
PROC PRINTDATA=mycas.reclamations_imputees;
45
title "Exemple 3: Imputation de Réclamations Manquantes par Département et Année";
Explanation : This example demonstrates integration in SAS Viya/CAS for response time imputation. It uses `SESSREF="casauto" SINGLE=YES` to ensure that `LAG` operations are performed correctly on a single thread in the distributed CAS environment. Missing values are imputed using the last known value within each `Service`. If no previous value is available (e.g., for the first observation of a service with a missing value), a default value of 30 is used. This case highlights the robustness of the imputation logic and the handling of edge cases.
Copied!
libname mycas cas;
data mycas.temps_reponse;
input Service $ Date : yymmdd. TempsReponse;
format Date yymmdd10.;
datalines;
Email 20230101 15
Email 20230102 20
Email 20230103 .
Email 20230104 .
Email 20230105 25
Chat 20230101 5
Chat 20230102 .
Chat 20230103 8
;
run;
data mycas.temps_reponse_impute (sessref="casauto" single=yes);
set mycas.temps_reponse;
by Service;
retain last_temps_reponse_service;
if first.Service then do;
lag_temps_reponse = .;
last_temps_reponse_service = .;
end;
else lag_temps_reponse = lag(TempsReponse);
if TempsReponse = . then do;
if lag_temps_reponse ne . then TempsReponse_impute = lag_temps_reponse;
else if last_temps_reponse_service ne . then TempsReponse_impute = last_temps_reponse_service;
else TempsReponse_impute = 30; /* Imputation par défaut à 30 si aucune valeur précédente */
end;
else do;
TempsReponse_impute = TempsReponse;
last_temps_reponse_service = TempsReponse;
end;
run;
proc print data=mycas.temps_reponse_impute;
title "Exemple 4: Imputation de Temps de Réponse Manquants dans CAS";
run;
proc cas;
table.dropTable / caslib="mycas" name="temps_reponse";
table.dropTable / caslib="mycas" name="temps_reponse_impute";
run;
1
LIBNAME mycas cas;
2
3
DATA mycas.temps_reponse;
4
INPUT Service $ Date : yymmdd. TempsReponse;
5
FORMAT Date yymmdd10.;
6
DATALINES;
7
Email 2023010115
8
Email 2023010220
9
Email 20230103 .
10
Email 20230104 .
11
Email 2023010525
12
Chat 202301015
13
Chat 20230102 .
14
Chat 202301038
15
;
16
RUN;
17
18
DATA mycas.temps_reponse_impute (sessref="casauto" single=yes);
19
SET mycas.temps_reponse;
20
BY Service;
21
retain last_temps_reponse_service;
22
23
IF first.Service THENDO;
24
lag_temps_reponse = .;
25
last_temps_reponse_service = .;
26
END;
27
ELSE lag_temps_reponse = lag(TempsReponse);
28
29
IF TempsReponse = . THENDO;
30
IF lag_temps_reponse ne . THEN TempsReponse_impute = lag_temps_reponse;
31
ELSEIF last_temps_reponse_service ne . THEN TempsReponse_impute = last_temps_reponse_service;
32
ELSE TempsReponse_impute = 30; /* Imputation par défaut à 30 si aucune valeur précédente */
33
END;
34
ELSEDO;
35
TempsReponse_impute = TempsReponse;
36
last_temps_reponse_service = TempsReponse;
37
END;
38
RUN;
39
40
PROC PRINTDATA=mycas.temps_reponse_impute;
41
title "Exemple 4: Imputation de Temps de Réponse Manquants dans CAS";
If your dataset is too large for single-threaded processing, consider using a CAS Data Step with a partition by clause. This allows you to run multiple threads in parallel—one for each group—as long as your imputation logic stays within the boundaries of those groups.
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. WeAreCAS is an independent community site and is not affiliated with SAS Institute Inc.
This site uses technical and analytical cookies to improve your experience.
Read more.