Published on :
Machine Learning CREATION_INTERNE

Multithreading in Data Mining and Machine Learning Procedures

This code is also available in: Deutsch Español Français
Awaiting validation
Multithreading, or the concurrent execution of threads, allows for substantial performance gains compared to sequential execution. For procedures running in CAS, the number of threads is determined by the installation. The multithreading model is primarily based on dividing the data processed on a single machine among the available threads. For example, if a table of 1,000 observations is processed by four threads, each thread handles 250 observations. All operations accessing data (such as variable normalization, matrix formation, calculation of objectives, gradients, Hessians, and observation scoring) are then multithreaded. Matrix operations also benefit from multithreading if the matrix size is sufficient to justify managing multiple threads.
Data Analysis

Type : CREATION_INTERNE


Examples use generated data (datalines) or SASHELP tables to ensure their autonomy.

1 Code Block
PROC FACTMAC Data
Explanation :
This example illustrates a basic use of the FACTMAC procedure. It starts by creating a small CAS in-memory data table (`casuser.produits`). Then, `PROC FACTMAC` is executed on this table, specifying a simple model where 'prix' (price) is the dependent variable and 'produit' (product) and 'stock' are the predictors. The result is saved in a new CAS table, 'casuser.factmac_output'. Multithreading is automatically managed by the CAS server to optimize calculations, by distributing observations among available threads.
Copied!
1/* Création d'une table CAS simple pour la démonstration */
2DATA casuser.produits;
3 INPUT produit $ prix stock;
4 DATALINES;
5Pomme 1.0 100
6Banane 0.5 200
7Orange 1.2 150
8Poire 0.8 120
9Raisin 2.5 50
10;
11RUN;
12 
13/* Exécution basique de PROC FACTMAC */
14PROC FACTMAC DATA=casuser.produits;
15 model prix = produit stock;
16 OUTPUT out=casuser.factmac_output;
17RUN;
18 
19/* Affichage des résultats (partiel) */
20PROC PRINT DATA=casuser.factmac_output (obs=5);
21RUN;
22 
23PROC CASUTIL incaslib="casuser";
24 droptable casdata="produits" quiet;
25 droptable casdata="factmac_output" quiet;
26RUN;
2 Code Block
PROC FOREST Data
Explanation :
This example uses `PROC FOREST` to train a random forest model to predict customer loyalty. It includes common options such as `NTREE` to specify the number of trees to build, `MAXDEPTH` for the maximum depth of the trees, and `SEED` for reproducibility. Intensive calculations for tree building and data processing are automatically parallelized by CAS across multiple threads, accelerating model training. The results and the trained model are saved in CAS tables.
Copied!
1/* Création d'une table CAS pour la démonstration de classification */
2DATA casuser.clients;
3 call streaminit(123);
4 DO i=1 to 1000;
5 age = rand('uniform', 18, 65);
6 revenu = rand('normal', 50000, 15000);
7 satisfaction_produit = ceil(rand('uniform', 1, 5));
8 IF revenu > 60000 and age > 30 THEN
9 loyal = 1;
10 ELSE
11 loyal = 0;
12 OUTPUT;
13 END;
14 drop i;
15RUN;
16 
17/* Exécution de PROC FOREST avec des options courantes */
18PROC FOREST DATA=casuser.clients;
19 INPUT age revenu satisfaction_produit;
20 target loyal;
21 ods OUTPUT FitStatistics=casuser.forest_stats;
22 train ntree=100 seed=1234 maxdepth=10;
23 save rstore=casuser.forest_model;
24RUN;
25 
26/* Affichage des statistiques d'ajustement */
27PROC PRINT DATA=casuser.forest_stats;
28RUN;
29 
30PROC CASUTIL incaslib="casuser";
31 droptable casdata="clients" quiet;
32 droptable casdata="forest_stats" quiet;
33 droptable casdata="forest_model" quiet;
34RUN;
3 Code Block
PROC NNET Data
Explanation :
This example demonstrates data scoring with a machine learning model, based on a common approach where a model (like NNET) would be trained and then used to predict on new data. Although `PROC NNET` itself is not explicitly called for training here (to simplify the demonstration without a complex pre-existing model), the concept of scoring a large volume of data is highly relevant to multithreading. The Data Step executes directly on data in CAS memory, and CAS automatically distributes computational tasks (here, scoring simulation) across multiple threads, allowing for efficient processing of large datasets.
Copied!
1/* Création d'une table CAS avec des données de test */
2DATA casuser.test_data;
3 call streaminit(456);
4 DO i=1 to 200;
5 age = rand('uniform', 20, 70);
6 revenu = rand('normal', 60000, 20000);
7 satisfaction_produit = ceil(rand('uniform', 1, 5));
8 OUTPUT;
9 END;
10 drop i;
11RUN;
12 
13/* Utilisation d'un modèle NNET pré-entraîné (ici, simulé par des coefficients simples) */
14/* Dans un scénario réel, un modèle NNET serait entraîné et sauvegardé */
15/* Ici, nous simulons l'étape de scoring */
16 
17DATA casuser.scored_data;
18 SET casuser.test_data;
19 /* Simulation d'un scoring simple basé sur les inputs */
20 score_loyal = 0.5 + 0.01 * age + 0.00001 * revenu - 0.1 * satisfaction_produit;
21 IF score_loyal > 0.7 THEN predicted_loyal = 1; ELSE predicted_loyal = 0;
22RUN;
23 
24/* Affichage des données scorées */
25PROC PRINT DATA=casuser.scored_data (obs=10);
26RUN;
27 
28PROC CASUTIL incaslib="casuser";
29 droptable casdata="test_data" quiet;
30 droptable casdata="scored_data" quiet;
31RUN;
4 Code Block
PROC GRADBOOST Data
Explanation :
This example uses `PROC GRADBOOST` to train a gradient boosting model on sales data, leveraging the distributed capabilities of CAS. The `PARTITION region_id` clause instructs CAS to process data by region, which can improve performance and scalability by distributing calculations across CAS nodes. Each partition can be processed in parallel by multiple threads. While there is no explicit `NTHREADS` option here, the underlying CAS server will automatically and efficiently allocate processing resources (i.e., threads) based on the environment configuration and data size, maximizing the benefit of multithreading for model training.
Copied!
1/* Création d'une table CAS avec une variable de partitionnement */
2DATA casuser.ventes;
3 call streaminit(789);
4 DO i=1 to 5000;
5 region_id = ceil(rand('uniform', 1, 4)); /* 4 régions */
6 publicite = rand('uniform', 0, 1000);
7 prix = rand('uniform', 10, 100);
8 ventes_reelles = 500 + 2 * publicite - 3 * prix + (region_id * 100) + rand('normal', 0, 50);
9 OUTPUT;
10 END;
11 drop i;
12RUN;
13 
14/* Utilisation de PROC GRADBOOST avec PARTITION et options de thread (implicites CAS) */
15PROC GRADBOOST DATA=casuser.ventes;
16 INPUT publicite prix;
17 target ventes_reelles;
18 partition region_id;
19 /* Les options de multithreading sont gérées par le serveur CAS. */
20 /* L'option NTHREADS peut être utilisée dans certaines procs pour limiter explicitement. */
21 /* Ici, nous laissons CAS gérer l'allocation automatique des threads par défaut. */
22 ods OUTPUT IterationHistory=casuser.gradboost_hist;
23RUN;
24 
25/* Affichage de l'historique d'itération */
26PROC PRINT DATA=casuser.gradboost_hist (obs=10);
27RUN;
28 
29PROC CASUTIL incaslib="casuser";
30 droptable casdata="ventes" quiet;
31 droptable casdata="gradboost_hist" quiet;
32RUN;
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Copyright Info : Copyright © SAS Institute Inc. All rights reserved.