FOREST Procedure (Random Forests)

The FOREST procedure is a powerful tool for supervised learning, capable of handling complex data and providing accurate predictions. It is particularly effective for its ability to manage overfitting and for its robustness to outliers and noise. The procedure performs calculations in distributed memory on the CAS server, allowing it to process very large datasets. It supports continuous and categorical input variables, as well as continuous (regression) or categorical (classification) target variables. Options are available for data partitioning, variable selection, number of trees, tree depth, and assessment of variable importance.

Data Analysis

Type : CREATION_INTERNE

Examples use generated data (datalines) or data from the SASHELP library, adapted to be loaded into CAS.

1 Code Block

PROC FOREST Data

Explanation :
This example illustrates a simple binary classification with PROC FOREST. It creates a CAS table named 'CreditData' with age, credit score, income, and client status (target) information. The procedure is then called with 'Age', 'ScoreCredit', 'Revenu' as input variables and 'StatutClient' as the nominal target variable. This is the most basic usage for training a random forest model.

Copied!

1	/* Création d'une session CAS et d'une caslib */
2	CAS;
3	LIBNAME mycas CAS;
4
5	/* Données d'exemple pour la classification binaire */
6	DATA mycas.CreditData;
7	INPUT Age ScoreCredit Revenu StatutClient $;
8	DATALINES;
9	25 700 50000 Bon
10	30 650 40000 Mauvais
11	35 720 60000 Bon
12	40 600 30000 Mauvais
13	45 750 70000 Bon
14	50 680 45000 Mauvais
15	60 710 55000 Bon
16	28 670 38000 Mauvais
17	33 730 62000 Bon
18	55 690 48000 Bon
19	;
20	RUN;
21
22	/* Exécution de la procédure FOREST pour la classification */
23	PROC FOREST DATA=mycas.CreditData;
24	INPUT Age ScoreCredit Revenu;
25	target StatutClient / level=nominal;
26	RUN;
27
28	/* Nettoyage de la session CAS si nécessaire */
29	CAS_TERMINATE;

2 Code Block

PROC FOREST Data

Explanation :
This intermediate example uses PROC FOREST for a regression task to predict sales. It introduces the 'partition' option to split the data into training and test sets (70%/30%), and 'varimportance' to calculate and display the importance of variables in the model. The 'Region' variable is explicitly defined as nominal. The trained model is saved via 'save state'.

Copied!

1	/* Création d'une session CAS et d'une caslib */
2	CAS;
3	LIBNAME mycas CAS;
4
5	/* Données d'exemple étendues pour la régression avec une variable catégorielle */
6	DATA mycas.SalesData;
7	INPUT Publicite Internet Vendeurs Region $ Ventes;
8	DATALINES;
9	10 5 2 Est 100
10	12 6 3 Ouest 120
11	8 4 2 Nord 90
12	15 7 4 Sud 150
13	11 5 3 Est 110
14	13 6 4 Ouest 130
15	9 4 2 Nord 95
16	16 8 5 Sud 160
17	10 5 3 Est 105
18	14 7 4 Ouest 140
19	;
20	RUN;
21
22	/* Exécution de la procédure FOREST avec partition et importance des variables */
23	PROC FOREST DATA=mycas.SalesData seed=12345;
24	INPUT Publicite Internet Vendeurs Region / level=interval Publicite Internet Vendeurs Region level=nominal Region;
25	target Ventes / level=interval;
26	partition fraction(0.7 train = training 0.3 test = testing);
27	varimportance;
28	ods OUTPUT VariableImportance=mycas.VarImp;
29	save state out=mycas.ForestModel / onestore;
30	RUN;
31
32	/* Affichage de l'importance des variables */
33	PROC PRINT DATA=mycas.VarImp;
34	RUN;
35
36	/* Nettoyage de la session CAS si nécessaire */
37	CAS_TERMINATE;

3 Code Block

PROC FOREST Data

Explanation :
This advanced example shows a regression for house price prediction. It uses options like 'ntrees' (number of trees), 'maxdepth' (maximum tree depth), 'nsubsets' (number of variables to sample at each node), and 'baggingfraction' to control the random forest training process. The 'proctime' option provides information on execution time. The trained model is saved and used to score new data via PROC ASTORE.

Copied!

1	/* Création d'une session CAS et d'une caslib */
2	CAS;
3	LIBNAME mycas CAS;
4
5	/* Données d'exemple pour la régression avec plus de complexité */
6	DATA mycas.HousingPrices;
7	INPUT Surface Chambres NbSallesBains AgeMaison Garage NbEtages PrixMaison;
8	DATALINES;
9	1500 3 2 10 1 2 250000
10	1200 2 1 20 0 1 180000
11	2000 4 3 5 2 3 350000
12	1000 2 1 30 0 1 150000
13	1800 3 2 15 1 2 290000
14	1300 3 1 25 1 1 200000
15	2200 4 3 8 2 3 380000
16	900 2 1 40 0 1 130000
17	1700 3 2 12 1 2 270000
18	1600 3 2 18 1 2 260000
19	;
20	RUN;
21
22	/* Exécution de la procédure FOREST avec tuning d'hyperparamètres et sortie détaillée */
23	PROC FOREST DATA=mycas.HousingPrices ntrees=100 maxdepth=10 nsubsets=5 seed=54321;
24	INPUT Surface Chambres NbSallesBains AgeMaison Garage NbEtages;
25	target PrixMaison / level=interval;
26	baggingfraction=0.7;
27	proctime;
28	performance nthreads=4;
29	ods OUTPUT FitStatistics=mycas.FitStats;
30	save rforest out=mycas.ForestModel_Adv;
31	RUN;
32
33	/* Création de nouvelles données pour la prédiction */
34	DATA mycas.NewHouses;
35	INPUT Surface Chambres NbSallesBains AgeMaison Garage NbEtages;
36	DATALINES;
37	1400 3 2 12 1 2
38	1900 4 2 7 2 3
39	;
40	RUN;
41
42	/* Application du modèle pour faire des prédictions */
43	PROC ASTORE;
44	score DATA=mycas.NewHouses
45	out=mycas.NewHouses_Scored
46	rstore=mycas.ForestModel_Adv;
47	RUN;
48
49	PROC PRINT DATA=mycas.NewHouses_Scored;
50	RUN;
51
52	/* Nettoyage de la session CAS si nécessaire */
53	CAS_TERMINATE;

4 Code Block

PROC FOREST Data

Explanation :
This example highlights the use of PROC FOREST in a CAS environment for classification. It loads an existing dataset (SASHELP.CLASS) into CAS memory, creates a new binary target variable ('TooOld'), and then trains a random forest model. Options like 'ntrees' and 'maxdepth' are adjusted. The model is saved and used to predict on a new dataset, demonstrating the typical workflow in a distributed environment.

Copied!

1	/* Établissement d'une session CAS */
2	CAS;
3	LIBNAME mycas CAS;
4
5	/* Chargement d'un jeu de données SASHELP dans CAS pour simuler un grand jeu de données */
6	/* (Assurez-vous que le jeu de données SASHELP.CLASS est disponible et a une taille raisonnable) */
7	PROC CASUTIL;
8	LOAD DATA=SASHELP.CLASS OUTCASLIB=mycas OUTCAS=ClassData REPLACE;
9	RUN;
10
11	/* Préparation des données: ajout d'une variable cible binaire pour la classification */
12	/* Exemple: 'TooOld' si Age > 14 */
13	DATA mycas.ClassDataPrepared;
14	SET mycas.ClassData;
15	IF Age > 14 THEN TooOld = 1;
16	ELSE TooOld = 0;
17	RUN;
18
19	/* Exécution de la procédure FOREST sur la table CAS */
20	PROC FOREST DATA=mycas.ClassDataPrepared ntrees=200 maxdepth=15 seed=67890;
21	INPUT Age Height Weight;
22	target TooOld / level=binary;
23	ods OUTPUT FitStatistics=mycas.ForestFitStats
24	IterationHistory=mycas.ForestIterHist;
25	save rforest out=mycas.BinaryForestModel;
26	RUN;
27
28	/* Vérification des statistiques d'ajustement */
29	PROC PRINT DATA=mycas.ForestFitStats;
30	RUN;
31
32	/* Chargement de nouvelles données pour le scoring */
33	DATA mycas.NewStudents;
34	INPUT Name $ Age Height Weight;
35	DATALINES;
36	John 15 65 120
37	Jane 12 58 90
38	Mike 17 70 150
39	Sarah 13 60 100
40	;
41	RUN;
42
43	/* Scoring des nouvelles données avec le modèle entraîné */
44	PROC ASTORE;
45	score DATA=mycas.NewStudents
46	out=mycas.NewStudents_Scored
47	rstore=mycas.BinaryForestModel;
48	RUN;
49
50	PROC PRINT DATA=mycas.NewStudents_Scored;
51	RUN;
52
53	/* Nettoyage de la session CAS */
54	CAS_TERMINATE;

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Retour à la liste

Expert Advice

Stéphanie

Spécialiste Machine Learning et IA.

« In SAS Viya, PROC FOREST implements the Random Forest algorithm—an ensemble method that builds hundreds of decision trees to produce a highly robust consensus prediction. By training on different bootstrap samples and random subsets of variables, it effectively eliminates the "memorization" (overfitting) issues common in single decision trees.

Pay close attention to the ntrees= and maxdepth= parameters. While more trees generally improve accuracy, they also increase memory consumption and scoring time. Start with 50–100 trees and monitor the FitStatistics table; if the error rate plateaus, adding more trees will only yield diminishing returns at the cost of performance. »