Published on :
Statistics CREATION_INTERNE

Quantile Binning (BINNING Procedure)

This code is also available in: Deutsch Español Français
Awaiting validation
The BINNING procedure is a powerful tool for discretizing continuous variables. When the 'QUANTILE' method is specified, the procedure analyzes the distribution of the variable and creates a defined number of bins, such that each bin contains an approximately equal proportion of observations. This is particularly useful for managing skewed distributions, reducing noise in data, or preparing variables for models that require categorical inputs. Key options include 'DATA' for the CAS input table, 'NUMBIN' to specify the desired number of bins, 'METHOD=QUANTILE' to enable the quantile binning method, 'INPUT' to select variables to discretize, and 'OUTPUT' or 'OUTBIN' to generate output tables containing binned data or bin details.
Data Analysis

Type : CREATION_INTERNE


All examples provide complete code for autonomous generation of demonstration data directly within the CAS session, thus ensuring their executability without external dependencies. Data is generated using DATA steps and the RAND() function.

1 Code Block
PROC BINNING Data
Explanation :
This example demonstrates the simplest use of the BINNING procedure for quantile binning. It creates a 'data_basic' table in the CAS session with two continuous variables. Then, PROC BINNING is called with the 'NUMBIN=10' option to divide 'x1' and 'x2' into 10 quantile bins. The resulting 'binned_basic' table contains the new binned variables.
Copied!
1/* Création d'une bibliothèque CAS pour les exemples si elle n'existe pas */
2caslib _all_ assign;
3 
4/* 1. Création des données en mémoire CAS */
5DATA mylib.data_basic;
6 call streaminit(123); /* Pour la reproductibilité */
7 DO id=1 to 10000;
8 x1 = rand("Uniform");
9 x2 = rand("Uniform") * 10;
10 OUTPUT;
11 END;
12RUN;
13 
14/* 2. Binning par quantiles */
15PROC BINNING DATA=mylib.data_basic numbin=10 method=quantile;
16 INPUT x1 x2;
17 OUTPUT out=mylib.binned_basic;
18RUN;
19 
20/* 3. Affichage des détails des bins pour vérification (échantillon) */
21PROC PRINT DATA=mylib.binned_basic (obs=20);
22RUN;
23 
2 Code Block
PROC BINNING Data
Explanation :
This example illustrates how to customize binning by specifying a different number of bins (NUMBIN=5) and using the 'OUTBIN' option. The 'OUTBIN' option creates a separate output table ('bin_details_inter') that contains detailed information about each bin created, such as limits, number of observations, and bin identifiers. This is useful for inspecting the discretization structure.
Copied!
1/* Création d'une bibliothèque CAS pour les exemples si elle n'existe pas */
2caslib _all_ assign;
3 
4/* 1. Création des données en mémoire CAS */
5DATA mylib.data_inter;
6 call streaminit(456); /* Pour la reproductibilité */
7 DO id=1 to 15000;
8 x1 = rand("Uniform");
9 x2 = rand("Normal"); /* Une autre distribution */
10 x3 = rand("Uniform") * 50;
11 OUTPUT;
12 END;
13RUN;
14 
15/* 2. Binning par quantiles avec options courantes */
16PROC BINNING DATA=mylib.data_inter numbin=5 method=quantile;
17 INPUT x1 x2 x3; /* Binning sur plusieurs variables */
18 OUTPUT out=mylib.binned_inter;
19 outbin out=mylib.bin_details_inter; /* Sauvegarde les détails des bins */
20RUN;
21 
22/* 3. Affichage des détails des bins */
23PROC PRINT DATA=mylib.bin_details_inter;
24 var _VARNAME_ _BINID_ _LOWER_ _UPPER_ _COUNT_;
25RUN;
26 
3 Code Block
PROC BINNING Data
Explanation :
This advanced example shows how to apply quantile binning independently for data subgroups using the 'BY' statement. A 'group' variable is created in the input data, and the BINNING procedure is executed separately for each 'group' value. The 'bin_stats_adv' table is generated to include comprehensive descriptive statistics for each bin and group, such as mean, standard deviation, minimum, and maximum.
Copied!
1/* Création d'une bibliothèque CAS pour les exemples si elle n'existe pas */
2caslib _all_ assign;
3 
4/* 1. Création des données en mémoire CAS avec un groupe */
5DATA mylib.data_adv;
6 call streaminit(789); /* Pour la reproductibilité */
7 DO id=1 to 20000;
8 IF mod(id, 2)=0 THEN group = "A";
9 ELSE group = "B";
10 x1 = rand("Uniform") + (group="A") * 0.5; /* x1 dépend du groupe */
11 x2 = rand("Normal") + (group="B") * 2; /* x2 dépend du groupe */
12 OUTPUT;
13 END;
14RUN;
15 
16/* 2. Binning par quantiles avec BY group et statistiques de sortie */
17PROC BINNING DATA=mylib.data_adv numbin=4 method=quantile;
18 BY group; /* Binning effectué séparément pour chaque groupe */
19 INPUT x1 x2;
20 OUTPUT out=mylib.binned_adv;
21 outbin out=mylib.bin_stats_adv;
22RUN;
23 
24/* 3. Affichage des détails des bins par groupe */
25PROC PRINT DATA=mylib.bin_stats_adv;
26 BY group;
27 var group _VARNAME_ _BINID_ _LOWER_ _UPPER_ _COUNT_ MEAN STD MIN MAX;
28RUN;
29 
4 Code Block
PROC BINNING Data
Explanation :
This example focuses on SAS Viya and the CAS engine's capabilities for handling very large datasets. It generates 5 million observations directly in CAS memory, then applies the BINNING procedure. Afterwards, it uses 'PROC CASUTIL' to directly interact with the CAS session: list tables, describe the structure of the binned table, and retrieve a sample of the data to verify binning application. This demonstrates a complete approach for CAS-based workflows.
Copied!
1/* Création d'une bibliothèque CAS pour les exemples si elle n'existe pas */
2caslib _all_ assign;
3 
4/* 1. Création d'une table CAS très volumineuse */
5DATA mylib.data_cas_large;
6 call streaminit(101112); /* Pour la reproductibilité */
7 DO id=1 to 5000000; /* 5 millions d'observations */
8 x1 = rand("Uniform");
9 x2 = rand("Exponential");
10 OUTPUT;
11 END;
12RUN;
13 
14/* 2. Binning par quantiles sur la grande table CAS */
15PROC BINNING DATA=mylib.data_cas_large numbin=10 method=quantile;
16 INPUT x1 x2;
17 OUTPUT out=mylib.binned_cas_large;
18 outbin out=mylib.bin_stats_cas_large;
19RUN;
20 
21/* 3. Inspection des propriétés de la table CAS résultante */
22PROC CASUTIL incaslib="mylib";
23 list files; /* Lister les fichiers dans la caslib pour voir les tables créées */
24 describe TABLE="binned_cas_large"; /* Décrire la structure de la table binnée */
25 fetch casdata="binned_cas_large" to=out; /* Récupérer un échantillon pour vérification */
26RUN;
27QUIT;
28 
29/* 4. Affichage d'un échantillon de la table binnée */
30PROC PRINT DATA=out (obs=10);
31RUN;
32 
This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.
Banner
Expert Advice
Expert
Michael
Responsable de l'infrastructure Viya.
« When using the OUTBIN option, pay close attention to the _LOWER_ and _UPPER_ values. If you see multiple bins with very similar or identical boundaries, it usually indicates a high frequency of "ties" (identical values) in your data. In such cases, you might consider reducing NUMBIN or pre-processing the duplicates to ensure the bins remain statistically meaningful. »