Published on :
Statistics CREATION_INTERNE

Causal Effect Estimation by Doubly Robust Methods with PROC CAEFFECT

This code is also available in: Deutsch Français
Awaiting validation
This document describes the application of doubly robust methods (AIPW and TMLE) via the CAEFFECT procedure. These methods are particularly useful because they remain consistent if either the treatment model or the outcome model is correctly specified (but not necessarily both). The example uses `PROC LOGSELECT` to model the probability of a treatment (here, smoking cessation) and `PROC BART` to model the outcome (weight change). The predictions from these intermediate models (`pTrt`, `pCnt`, `P_Change`) are then used as inputs for the `CAEFFECT` procedure. The CAEFFECT procedure estimates potential outcome means (POM) for different treatment levels and calculates the difference between these POMs to estimate the average treatment effect (ATE). This approach is robust to misspecification errors in either of the two models.
Data Analysis

Type : CREATION_INTERNE


The examples use synthetic data generated by a `DATA step` with `datalines` to ensure their autonomy. A `SmokingWeight` table is created in a CAS session.

1 Code Block
PROC CAEFFECT Data
Explanation :
This example illustrates the basic estimation of causal effects using the AIPW (Augmented Inverse Probability Weighting) method.
1. **Data Preparation:** A `mylib.SmokingWeight` table is created in a CAS session with synthetic data.
2. **Treatment Model:** `PROC LOGSELECT` is used to model the probability of quitting smoking (`Quit=1`) based on covariates, generating the probability `pTrt` and `pCnt` (probability of not quitting).
3. **Outcome Model:** `PROC BART` (Bayesian Additive Regression Trees) is executed to model the weight change (`Change`) based on covariates and smoking status (`Quit`), and the fitted model is saved in `mylib.bartOutMod`.
4. **PROC CAEFFECT:** The `CAEFFECT` procedure is then called with `METHOD=AIPW`, specifying the treatment variable (`Quit`), the outcome variable (`Change`), and the outcome model (`mylib.bartOutMod`). The `POM` statements define potential outcomes for each treatment level (quitting or not) using the predicted probabilities. The `DIFFERENCE` statement calculates the difference between the means of the potential outcomes.
Copied!
1/* Création d'une session CAS si elle n'existe pas */
2cas;
3 
4/* Assurez-vous que la caslib mylib est disponible et mappée */
5caslib _all_ assign;
6 
7/* Création de la table de données SmokingWeight */
8DATA mylib.SmokingWeight;
9 INFILE DATALINES;
10 INPUT Sex $ Race $ Education $ Exercise $ Activity $ Age YearsSmoke PerDay QUIT Change;
11 DATALINES;
12Female White HighSchool Light Active 30 10 15 0 2.5
13Male Black College Moderate Sedentary 45 20 20 1 5.0
14Female Asian GradSchool Heavy Active 55 30 25 1 7.2
15Male White HighSchool Light Sedentary 35 12 10 0 1.8
16Female Black College Moderate Active 40 18 18 1 4.5
17Male Asian GradSchool Heavy Sedentary 60 25 22 0 3.1
18Female White HighSchool Light Active 28 8 12 1 6.0
19Male Black College Moderate Sedentary 50 22 19 0 2.0
20Female Asian GradSchool Heavy Active 48 28 24 1 8.0
21Male White HighSchool Light Sedentary 32 10 11 0 1.5
22;
23RUN;
24 
25/* 1. Modélisation de la variable de traitement 'Quit' avec PROC LOGSELECT */
26PROC LOGSELECT DATA=mylib.SmokingWeight;
27 class Sex Race Education Exercise Activity;
28 model QUIT(Event='1') = Sex Race Education
29 Exercise Activity Age YearsSmoke PerDay;
30 OUTPUT out=mylib.swDREstData pred=pTrt copyvars=(_ALL_);
31RUN;
32 
33/* Calcul de la probabilité de la condition de contrôle (Quit=0) */
34DATA mylib.swDREstData;
35 SET mylib.swDREstData;
36 pCnt = 1 - pTrt;
37RUN;
38 
39/* 2. Modélisation de la variable de résultat 'Change' avec PROC BART */
40PROC BART DATA=mylib.swDREstData nTree=100 nMC=200 seed=2156;
41 class Sex Race Education Exercise Activity QUIT;
42 model Change = Sex Race Education Exercise QUIT Activity
43 Age YearsSmoke PerDay;
44 store out=mylib.bartOutMod;
45RUN;
46 
47/* 3. Estimation des effets causaux avec PROC CAEFFECT (AIPW) */
48PROC CAEFFECT DATA=mylib.swDREstData method=aipw;
49 treatvar QUIT;
50 outcomevar Change;
51 outcomemodel restore=mylib.bartOutMod predName=P_Change;
52 pom treatLev=1 treatProb=pTrt;
53 pom treatLev=0 treatProb=pCnt;
54 difference evtLev=1;
55RUN;
56 
2 Code Block
PROC CAEFFECT Data
Explanation :
This example demonstrates the use of the TMLE (Targeted Maximum Likelihood Estimation) method, an alternative to AIPW, for causal effect estimation.
1. **Data preparation and intermediate models:** The steps for data creation, treatment modeling with `PROC LOGSELECT`, and outcome modeling with `PROC BART` are identical to those in Example 1.
2. **PROC CAEFFECT with TMLE:** The only difference is the `METHOD=tmle` parameter in the `PROC CAEFFECT` statement. The TMLE method directly targets the parameter of interest (the ATE) by iteratively adjusting the outcome model predictions, which can improve the robustness and efficiency of the estimation compared to AIPW in some cases. The results obtained are generally very similar to those of AIPW, as shown in the documentation.
Copied!
1/* Création d'une session CAS si elle n'existe pas */
2cas;
3 
4/* Assurez-vous que la caslib mylib est disponible et mappée */
5caslib _all_ assign;
6 
7/* Création de la table de données SmokingWeight */
8DATA mylib.SmokingWeight;
9 INFILE DATALINES;
10 INPUT Sex $ Race $ Education $ Exercise $ Activity $ Age YearsSmoke PerDay QUIT Change;
11 DATALINES;
12Female White HighSchool Light Active 30 10 15 0 2.5
13Male Black College Moderate Sedentary 45 20 20 1 5.0
14Female Asian GradSchool Heavy Active 55 30 25 1 7.2
15Male White HighSchool Light Sedentary 35 12 10 0 1.8
16Female Black College Moderate Active 40 18 18 1 4.5
17Male Asian GradSchool Heavy Sedentary 60 25 22 0 3.1
18Female White HighSchool Light Active 28 8 12 1 6.0
19Male Black College Moderate Sedentary 50 22 19 0 2.0
20Female Asian GradSchool Heavy Active 48 28 24 1 8.0
21Male White HighSchool Light Sedentary 32 10 11 0 1.5
22;
23RUN;
24 
25/* 1. Modélisation de la variable de traitement 'Quit' avec PROC LOGSELECT */
26PROC LOGSELECT DATA=mylib.SmokingWeight;
27 class Sex Race Education Exercise Activity;
28 model QUIT(Event='1') = Sex Race Education
29 Exercise Activity Age YearsSmoke PerDay;
30 OUTPUT out=mylib.swDREstData pred=pTrt copyvars=(_ALL_);
31RUN;
32 
33/* Calcul de la probabilité de la condition de contrôle (Quit=0) */
34DATA mylib.swDREstData;
35 SET mylib.swDREstData;
36 pCnt = 1 - pTrt;
37RUN;
38 
39/* 2. Modélisation de la variable de résultat 'Change' avec PROC BART */
40PROC BART DATA=mylib.swDREstData nTree=100 nMC=200 seed=2156;
41 class Sex Race Education Exercise Activity QUIT;
42 model Change = Sex Race Education Exercise QUIT Activity
43 Age YearsSmoke PerDay;
44 store out=mylib.bartOutMod;
45RUN;
46 
47/* 3. Estimation des effets causaux avec PROC CAEFFECT (TMLE) */
48PROC CAEFFECT DATA=mylib.swDREstData method=tmle;
49 treatvar QUIT;
50 outcomevar Change;
51 outcomemodel restore=mylib.bartOutMod predName=P_Change;
52 pom treatLev=1 treatProb=pTrt;
53 pom treatLev=0 treatProb=pCnt;
54 difference evtLev=1;
55RUN;
56 
3 Code Block
PROC CAEFFECT Data
Explanation :
This advanced example illustrates a subgroup analysis using the AIPW method and saving the results to output tables. The objective is to see if the treatment effect (smoking cessation) on weight change varies by sex.
1. **Data preparation and intermediate models:** The steps are the same as in the previous examples.
2. **Subgroup Analysis:** The `CLASS Sex;` statement declares the `Sex` variable as a classification variable, and the `BY Sex;` statement instructs `PROC CAEFFECT` to repeat the analysis for each level of `Sex` (Male and Female).
3. **Output Options:** The `ODS OUTPUT` statements are used to save the estimates (`Estimates`) and differences (`OutDiff`) tables generated by the procedure into new CAS tables (`mylib.aipwEstimates` and `mylib.aipwDifferences`). This allows for further analysis of the results or their integration into other reports.
Copied!
1/* Création d'une session CAS si elle n'existe pas */
2cas;
3 
4/* Assurez-vous que la caslib mylib est disponible et mappée */
5caslib _all_ assign;
6 
7/* Création de la table de données SmokingWeight */
8DATA mylib.SmokingWeight;
9 INFILE DATALINES;
10 INPUT Sex $ Race $ Education $ Exercise $ Activity $ Age YearsSmoke PerDay QUIT Change;
11 DATALINES;
12Female White HighSchool Light Active 30 10 15 0 2.5
13Male Black College Moderate Sedentary 45 20 20 1 5.0
14Female Asian GradSchool Heavy Active 55 30 25 1 7.2
15Male White HighSchool Light Sedentary 35 12 10 0 1.8
16Female Black College Moderate Active 40 18 18 1 4.5
17Male Asian GradSchool Heavy Sedentary 60 25 22 0 3.1
18Female White HighSchool Light Active 28 8 12 1 6.0
19Male Black College Moderate Sedentary 50 22 19 0 2.0
20Female Asian GradSchool Heavy Active 48 28 24 1 8.0
21Male White HighSchool Light Sedentary 32 10 11 0 1.5
22;
23RUN;
24 
25/* 1. Modélisation de la variable de traitement 'Quit' avec PROC LOGSELECT */
26PROC LOGSELECT DATA=mylib.SmokingWeight;
27 class Sex Race Education Exercise Activity;
28 model QUIT(Event='1') = Sex Race Education
29 Exercise Activity Age YearsSmoke PerDay;
30 OUTPUT out=mylib.swDREstData pred=pTrt copyvars=(_ALL_);
31RUN;
32 
33/* Calcul de la probabilité de la condition de contrôle (Quit=0) */
34DATA mylib.swDREstData;
35 SET mylib.swDREstData;
36 pCnt = 1 - pTrt;
37RUN;
38 
39/* 2. Modélisation de la variable de résultat 'Change' avec PROC BART */
40PROC BART DATA=mylib.swDREstData nTree=100 nMC=200 seed=2156;
41 class Sex Race Education Exercise Activity QUIT;
42 model Change = Sex Race Education Exercise QUIT Activity
43 Age YearsSmoke PerDay;
44 store out=mylib.bartOutMod;
45RUN;
46 
47/* 3. Estimation des effets causaux avec PROC CAEFFECT (AIPW) par sous-groupe */
48PROC CAEFFECT DATA=mylib.swDREstData method=aipw;
49 class Sex; /* Ajout d'une variable de classification pour l'analyse par groupe */
50 treatvar QUIT;
51 outcomevar Change;
52 outcomemodel restore=mylib.bartOutMod predName=P_Change;
53 pom treatLev=1 treatProb=pTrt;
54 pom treatLev=0 treatProb=pCnt;
55 difference evtLev=1;
56 BY Sex; /* Analyse séparée par le sexe */
57 ods OUTPUT Estimates=mylib.aipwEstimates OutDiff=mylib.aipwDifferences; /* Sauvegarde des résultats */
58RUN;
59 
4 Code Block
PROC CAEFFECT Data
Explanation :
This example focuses on integration and optimization for SAS Viya/CAS environments, particularly for large datasets, using the TMLE method.
1. **CAS Session Management and Cleanup:** The `cas;` and `caslib _all_ assign;` commands ensure that the CAS session is active and the caslib is mapped. `PROC CASUTIL` is used to clean up residual CAS tables before creating new data.
2. **Large-Scale Data Generation:** A `mylib.SmokingWeight` table with 10,000 observations is generated to simulate a large dataset, using `promote=yes` to ensure it is a CAS table.
3. **Procedure Optimization:** For `PROC LOGSELECT` and `PROC BART`, parameters like `nthreads` (for `LOGSELECT`) are used to take advantage of CAS parallel processing. The number of trees (`nTree`) and Monte Carlo iterations (`nMC`) for `BART` is increased to reflect a more robust model on larger datasets.
4. **PROC CAEFFECT:** The `CAEFFECT` procedure is then executed with `METHOD=TMLE`, demonstrating its applicability and performance on larger data volumes in the CAS environment.
Copied!
1/* Création d'une session CAS si elle n'existe pas */
2cas;
3 
4/* Assurez-vous que la caslib mylib est disponible et mappée */
5caslib _all_ assign;
6 
7/* Nettoyage des tables existantes pour s'assurer d'un redémarrage propre */
8PROC CASUTIL;
9 droptable casdata='SmokingWeight' incaslib='mylib' quiet;
10 droptable casdata='swDREstData' incaslib='mylib' quiet;
11 droptable casdata='bartOutMod' incaslib='mylib' quiet;
12RUN;
13 
14/* Création d'un ensemble de données plus grand simulé */
15DATA mylib.SmokingWeight(promote=yes);
16 array char_vars Sex $ Race $ Education $ Exercise $ Activity $;
17 DO i = 1 to 10000; /* Simule 10000 observations */
18 call streaminit(12345 + i);
19 DO _n_ = 1 to dim(char_vars);
20 char_vars(_n_) = ranword(char_vars(_n_)); /* Générer des valeurs de caractères aléatoires */
21 END;
22 
23 IF rand('UNIFORM') < 0.5 THEN Sex = 'Female'; ELSE Sex = 'Male';
24 IF rand('UNIFORM') < 0.7 THEN Race = 'White'; ELSE Race = 'Black';
25 IF rand('UNIFORM') < 0.6 THEN Education = 'HighSchool'; ELSE Education = 'College';
26 IF rand('UNIFORM') < 0.5 THEN Exercise = 'Light'; ELSE Exercise = 'Moderate';
27 IF rand('UNIFORM') < 0.5 THEN Activity = 'Active'; ELSE Activity = 'Sedentary';
28 Age = round(rand('NORMAL', 40, 10));
29 YearsSmoke = round(rand('NORMAL', 15, 5));
30 PerDay = round(rand('NORMAL', 15, 5));
31 IF rand('UNIFORM') < 0.5 THEN QUIT = 0; ELSE QUIT = 1;
32 Change = rand('NORMAL', 3, 1.5) + (QUIT * rand('NORMAL', 2, 0.5)) - (PerDay * 0.1);
33 OUTPUT;
34 END;
35 drop i;
36 label QUIT = "Arrêt du Tabac (0=Non, 1=Oui)";
37 label Change = "Changement de Poids (kg)";
38RUN;
39 
40/* 1. Modélisation de la variable de traitement 'Quit' avec PROC LOGSELECT */
41/* Utilisation de l'option NTHREADS pour le parallélisme en CAS */
42PROC LOGSELECT DATA=mylib.SmokingWeight nthreads=4;
43 class Sex Race Education Exercise Activity;
44 model QUIT(Event='1') = Sex Race Education
45 Exercise Activity Age YearsSmoke PerDay;
46 OUTPUT out=mylib.swDREstData pred=pTrt copyvars=(_ALL_);
47RUN;
48 
49/* Calcul de la probabilité de la condition de contrôle (Quit=0) */
50DATA mylib.swDREstData;
51 SET mylib.swDREstData;
52 pCnt = 1 - pTrt;
53RUN;
54 
55/* 2. Modélisation de la variable de résultat 'Change' avec PROC BART */
56/* Utilisation d'un Ntree et Nmc appropriés pour de grands jeux de données */
57PROC BART DATA=mylib.swDREstData nTree=200 nMC=500 seed=2156;
58 class Sex Race Education Exercise Activity QUIT;
59 model Change = Sex Race Education Exercise QUIT Activity
60 Age YearsSmoke PerDay;
61 store out=mylib.bartOutMod;
62RUN;
63 
64/* 3. Estimation des effets causaux avec PROC CAEFFECT (TMLE) pour grands ensembles de données */
65PROC CAEFFECT DATA=mylib.swDREstData method=tmle;
66 treatvar QUIT;
67 outcomevar Change;
68 outcomemodel restore=mylib.bartOutMod predName=P_Change;
69 pom treatLev=1 treatProb=pTrt;
70 pom treatLev=0 treatProb=pCnt;
71 difference evtLev=1;
72RUN;
73 
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : Copyright © SAS Institute Inc. All Rights Reserved


Banner
Expert Advice
Expert
Stéphanie
Spécialiste Machine Learning et IA.
« Always check the distribution of your propensity scores (pTrt). If the scores are heavily clustered near 0 or 1, the weights used in Doubly Robust methods can become unstable. In such cases, TMLE is often the more reliable choice due to its semi-parametric nature. »