Probit Model for Binary Data with PROC BART

This example illustrates how to apply the BART procedure to model binary response data. It uses an email dataset to predict whether an email is spam ('Class' variable). The procedure includes partitioning the data into training and test sets via the 'Test' variable and specifies a binary distribution for the response variable. Results include fit statistics and a classification matrix to evaluate model performance. The BART procedure is a non-parametric modeling tool based on decision trees that is particularly effective at capturing complex relationships in data.

Data Analysis

Type : CREATION_INTERNE_ET_SASHELP

Examples use data generated from SASHELP.JunkMail or synthetic data for specific cases, ensuring the autonomy of each code block.

1 Code Block

PROC BART Data

Explanation :
This example demonstrates the minimal use of PROC BART for a binary probit model. It loads the SASHELP.JunkMail dataset into a CAS library ('mylib') and specifies the response variable 'Class' with a binary distribution ('dist=binary'), as well as a reduced set of predictors. Simple partitioning is used to divide the data into training and test sets via the 'Test' variable.

Copied!

1	/* Préparation des données dans CAS */
2	cas;
3	LIBNAME mylib cas;
4
5	DATA mylib.JunkMail;
6	SET Sashelp.JunkMail;
7	RUN;
8
9	/* Utilisation Basique de PROC BART pour un modèle probit binaire */
10	PROC BART DATA=mylib.JunkMail seed=12345;
11	model Class = Make Address All _3d Our Internet Order / dist=binary;
12	partition rolevar=Test(train='0' test='1');
13	RUN;

2 Code Block

PROC BART Data

Explanation :
This example extends the use of PROC BART by including common options for MCMC simulation: the number of MCMC chains (NCHAIN=4), the number of burn-in iterations (NBI=1000), and the number of samples to keep (NSAMPLE=5000). The OUTPUT statement is used to save probability predictions (p=ProbSpam) and the predicted class (pred=PredictedSpam) into a new CAS table, allowing for detailed model evaluation. A sample of the predictions is then displayed.

Copied!

1	/* Préparation des données dans CAS */
2	cas;
3	LIBNAME mylib cas;
4
5	DATA mylib.JunkMail;
6	SET Sashelp.JunkMail;
7	RUN;
8
9	/* Options courantes et sortie des prédictions */
10	PROC BART DATA=mylib.JunkMail seed=67890;
11	model Class = Make Address All _3d Our Internet Order Mail Receive Will / dist=binary;
12	/* Options MCMC pour une meilleure convergence et stabilité */
13	mcmc nchain=4 nbi=1000 nsample=5000;
14	partition rolevar=Test(train='0' test='1');
15	/* Sauvegarde les prédictions et probabilités dans une table CAS */
16	OUTPUT out=mylib.BartPredictions copyvars=(Class Test) p=ProbSpam pred=PredictedSpam;
17	RUN;
18
19	/* Afficher les premières lignes des prédictions */
20	PROC PRINT DATA=mylib.BartPredictions(obs=10);
21	RUN;

3 Code Block

PROC BART Data

Explanation :
This advanced example illustrates the customization of BART tree hyperparameters via the PRIOR statement, specifying the number of trees (NTREE=100), the minimum number of observations for a split (NMINSPLIT=5) and for a leaf (NMINLEAF=3). It also integrates cross-validation (CV NCV=5) to evaluate model robustness. A new interaction variable 'Length_Exclamation' is created to demonstrate data manipulation before the model. Predictions are saved for further analysis.

Copied!

1	/* Préparation des données dans CAS */
2	cas;
3	LIBNAME mylib cas;
4
5	DATA mylib.JunkMail;
6	SET Sashelp.JunkMail;
7	/* Exemple de transformation de variable: créer une variable d'interaction simple */
8	Length_Exclamation = LENGTH * Exclamation;
9	RUN;
10
11	/* Cas Avancé : hyperparamètres, validation croisée */
12	PROC BART DATA=mylib.JunkMail seed=112233;
13	model Class = Make Address All _3d Our Internet Order Mail Receive Will People Report Length_Exclamation / dist=binary;
14	/* Personnalisation des hyperparamètres de l'arbre BART */
15	prior bart ntree=100 nminsplit=5 nminleaf=3;
16	/* Utilisation de la validation croisée (5 plis) pour évaluer la robustesse du modèle */
17	cv ncv=5 foldvar=Test(train='0' test='1');
18	OUTPUT out=mylib.BartAdvancedPred p=ProbSpam pred=PredictedSpam;
19	RUN;
20

4 Code Block

PROC BART Data

Explanation :
This example highlights the integration of PROC BART with the SAS Viya/CAS environment. First, a BART model is trained and saved as an rstore item (BartModel_rstore) in the CAS library. Then, a second invocation of PROC BART uses the RSTORE option to load this saved model and apply it ('score') to a subset of the data (here, the test set). Predictions are saved into a new CAS table, and a frequency table is generated to evaluate the model's classification matrix on the new data. This is essential for the effective deployment and reuse of models in production.

Copied!

1	/* Préparation des données dans CAS */
2	cas;
3	LIBNAME mylib cas;
4
5	DATA mylib.JunkMail;
6	SET Sashelp.JunkMail;
7	RUN;
8
9	/* Entraînement et sauvegarde du modèle BART */
10	PROC BART DATA=mylib.JunkMail seed=445566;
11	model Class = Make Address All _3d Our Internet Order Mail Receive Will People Report Addresses Free Business / dist=binary;
12	partition rolevar=Test(train='0' test='1');
13	/* Sauvegarde le modèle entraîné dans CAS pour réutilisation */
14	save rstore=mylib.BartModel_rstore;
15	RUN;
16
17	/* Chargement du modèle sauvegardé et score sur de nouvelles données (ici l'ensemble de test) */
18	PROC BART DATA=mylib.JunkMail(where=(Test='1')) rstore=mylib.BartModel_rstore;
19	score out=mylib.BartScoredData copyvars=(Class Test) p=ScoreProb pred=ScorePred;
20	RUN;
21
22	/* Vérifier les résultats du scoring */
23	PROC FREQ DATA=mylib.BartScoredData;
24	tables Class*ScorePred;
25	RUN;
26
27	PROC PRINT DATA=mylib.BartScoredData(obs=10);
28	RUN;

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Retour à la liste

Expert Advice

Michael

Responsable de l'infrastructure Viya.

« When dealing with binary outcomes, pay close attention to the mcmc statement. Increasing nbi (burn-in) and nsample ensures that the posterior distribution has stabilized. If your classification results vary significantly between runs with different seeds, it is a sign you should increase your MCMC samples for better convergence. »