Multithreading in Data Mining and Machine Learning Procedures

Multithreading, or the concurrent execution of threads, allows for substantial performance gains compared to sequential execution. For procedures running in CAS, the number of threads is determined by the installation. The multithreading model is primarily based on dividing the data processed on a single machine among the available threads. For example, if a table of 1,000 observations is processed by four threads, each thread handles 250 observations. All operations accessing data (such as variable normalization, matrix formation, calculation of objectives, gradients, Hessians, and observation scoring) are then multithreaded. Matrix operations also benefit from multithreading if the matrix size is sufficient to justify managing multiple threads.

Data Analysis

Type : CREATION_INTERNE

Examples use generated data (datalines) or SASHELP tables to ensure their autonomy.

1 Code Block

PROC FACTMAC Data

Explanation :
This example illustrates a basic use of the FACTMAC procedure. It starts by creating a small CAS in-memory data table (`casuser.produits`). Then, `PROC FACTMAC` is executed on this table, specifying a simple model where 'prix' (price) is the dependent variable and 'produit' (product) and 'stock' are the predictors. The result is saved in a new CAS table, 'casuser.factmac_output'. Multithreading is automatically managed by the CAS server to optimize calculations, by distributing observations among available threads.

Copied!

1	/* Création d'une table CAS simple pour la démonstration */
2	DATA casuser.produits;
3	INPUT produit $ prix stock;
4	DATALINES;
5	Pomme 1.0 100
6	Banane 0.5 200
7	Orange 1.2 150
8	Poire 0.8 120
9	Raisin 2.5 50
10	;
11	RUN;
12
13	/* Exécution basique de PROC FACTMAC */
14	PROC FACTMAC DATA=casuser.produits;
15	model prix = produit stock;
16	OUTPUT out=casuser.factmac_output;
17	RUN;
18
19	/* Affichage des résultats (partiel) */
20	PROC PRINT DATA=casuser.factmac_output (obs=5);
21	RUN;
22
23	PROC CASUTIL incaslib="casuser";
24	droptable casdata="produits" quiet;
25	droptable casdata="factmac_output" quiet;
26	RUN;

2 Code Block

PROC FOREST Data

Explanation :
This example uses `PROC FOREST` to train a random forest model to predict customer loyalty. It includes common options such as `NTREE` to specify the number of trees to build, `MAXDEPTH` for the maximum depth of the trees, and `SEED` for reproducibility. Intensive calculations for tree building and data processing are automatically parallelized by CAS across multiple threads, accelerating model training. The results and the trained model are saved in CAS tables.

Copied!

1	/* Création d'une table CAS pour la démonstration de classification */
2	DATA casuser.clients;
3	call streaminit(123);
4	DO i=1 to 1000;
5	age = rand('uniform', 18, 65);
6	revenu = rand('normal', 50000, 15000);
7	satisfaction_produit = ceil(rand('uniform', 1, 5));
8	IF revenu > 60000 and age > 30 THEN
9	loyal = 1;
10	ELSE
11	loyal = 0;
12	OUTPUT;
13	END;
14	drop i;
15	RUN;
16
17	/* Exécution de PROC FOREST avec des options courantes */
18	PROC FOREST DATA=casuser.clients;
19	INPUT age revenu satisfaction_produit;
20	target loyal;
21	ods OUTPUT FitStatistics=casuser.forest_stats;
22	train ntree=100 seed=1234 maxdepth=10;
23	save rstore=casuser.forest_model;
24	RUN;
25
26	/* Affichage des statistiques d'ajustement */
27	PROC PRINT DATA=casuser.forest_stats;
28	RUN;
29
30	PROC CASUTIL incaslib="casuser";
31	droptable casdata="clients" quiet;
32	droptable casdata="forest_stats" quiet;
33	droptable casdata="forest_model" quiet;
34	RUN;

3 Code Block

PROC NNET Data

Explanation :
This example demonstrates data scoring with a machine learning model, based on a common approach where a model (like NNET) would be trained and then used to predict on new data. Although `PROC NNET` itself is not explicitly called for training here (to simplify the demonstration without a complex pre-existing model), the concept of scoring a large volume of data is highly relevant to multithreading. The Data Step executes directly on data in CAS memory, and CAS automatically distributes computational tasks (here, scoring simulation) across multiple threads, allowing for efficient processing of large datasets.

Copied!

1	/* Création d'une table CAS avec des données de test */
2	DATA casuser.test_data;
3	call streaminit(456);
4	DO i=1 to 200;
5	age = rand('uniform', 20, 70);
6	revenu = rand('normal', 60000, 20000);
7	satisfaction_produit = ceil(rand('uniform', 1, 5));
8	OUTPUT;
9	END;
10	drop i;
11	RUN;
12
13	/* Utilisation d'un modèle NNET pré-entraîné (ici, simulé par des coefficients simples) */
14	/* Dans un scénario réel, un modèle NNET serait entraîné et sauvegardé */
15	/* Ici, nous simulons l'étape de scoring */
16
17	DATA casuser.scored_data;
18	SET casuser.test_data;
19	/* Simulation d'un scoring simple basé sur les inputs */
20	score_loyal = 0.5 + 0.01 * age + 0.00001 * revenu - 0.1 * satisfaction_produit;
21	IF score_loyal > 0.7 THEN predicted_loyal = 1; ELSE predicted_loyal = 0;
22	RUN;
23
24	/* Affichage des données scorées */
25	PROC PRINT DATA=casuser.scored_data (obs=10);
26	RUN;
27
28	PROC CASUTIL incaslib="casuser";
29	droptable casdata="test_data" quiet;
30	droptable casdata="scored_data" quiet;
31	RUN;

4 Code Block

PROC GRADBOOST Data

Explanation :
This example uses `PROC GRADBOOST` to train a gradient boosting model on sales data, leveraging the distributed capabilities of CAS. The `PARTITION region_id` clause instructs CAS to process data by region, which can improve performance and scalability by distributing calculations across CAS nodes. Each partition can be processed in parallel by multiple threads. While there is no explicit `NTHREADS` option here, the underlying CAS server will automatically and efficiently allocate processing resources (i.e., threads) based on the environment configuration and data size, maximizing the benefit of multithreading for model training.

Copied!

1	/* Création d'une table CAS avec une variable de partitionnement */
2	DATA casuser.ventes;
3	call streaminit(789);
4	DO i=1 to 5000;
5	region_id = ceil(rand('uniform', 1, 4)); /* 4 régions */
6	publicite = rand('uniform', 0, 1000);
7	prix = rand('uniform', 10, 100);
8	ventes_reelles = 500 + 2 * publicite - 3 * prix + (region_id * 100) + rand('normal', 0, 50);
9	OUTPUT;
10	END;
11	drop i;
12	RUN;
13
14	/* Utilisation de PROC GRADBOOST avec PARTITION et options de thread (implicites CAS) */
15	PROC GRADBOOST DATA=casuser.ventes;
16	INPUT publicite prix;
17	target ventes_reelles;
18	partition region_id;
19	/* Les options de multithreading sont gérées par le serveur CAS. */
20	/* L'option NTHREADS peut être utilisée dans certaines procs pour limiter explicitement. */
21	/* Ici, nous laissons CAS gérer l'allocation automatique des threads par défaut. */
22	ods OUTPUT IterationHistory=casuser.gradboost_hist;
23	RUN;
24
25	/* Affichage de l'historique d'itération */
26	PROC PRINT DATA=casuser.gradboost_hist (obs=10);
27	RUN;
28
29	PROC CASUTIL incaslib="casuser";
30	droptable casdata="ventes" quiet;
31	droptable casdata="gradboost_hist" quiet;
32	RUN;

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Retour à la liste