Quantile Binning (BINNING Procedure)

The BINNING procedure is a powerful tool for discretizing continuous variables. When the 'QUANTILE' method is specified, the procedure analyzes the distribution of the variable and creates a defined number of bins, such that each bin contains an approximately equal proportion of observations. This is particularly useful for managing skewed distributions, reducing noise in data, or preparing variables for models that require categorical inputs. Key options include 'DATA' for the CAS input table, 'NUMBIN' to specify the desired number of bins, 'METHOD=QUANTILE' to enable the quantile binning method, 'INPUT' to select variables to discretize, and 'OUTPUT' or 'OUTBIN' to generate output tables containing binned data or bin details.

Data Analysis

Type : CREATION_INTERNE

All examples provide complete code for autonomous generation of demonstration data directly within the CAS session, thus ensuring their executability without external dependencies. Data is generated using DATA steps and the RAND() function.

1 Code Block

PROC BINNING Data

Explanation :
This example demonstrates the simplest use of the BINNING procedure for quantile binning. It creates a 'data_basic' table in the CAS session with two continuous variables. Then, PROC BINNING is called with the 'NUMBIN=10' option to divide 'x1' and 'x2' into 10 quantile bins. The resulting 'binned_basic' table contains the new binned variables.

Copied!

1	/* Création d'une bibliothèque CAS pour les exemples si elle n'existe pas */
2	caslib _all_ assign;
3
4	/* 1. Création des données en mémoire CAS */
5	DATA mylib.data_basic;
6	call streaminit(123); /* Pour la reproductibilité */
7	DO id=1 to 10000;
8	x1 = rand("Uniform");
9	x2 = rand("Uniform") * 10;
10	OUTPUT;
11	END;
12	RUN;
13
14	/* 2. Binning par quantiles */
15	PROC BINNING DATA=mylib.data_basic numbin=10 method=quantile;
16	INPUT x1 x2;
17	OUTPUT out=mylib.binned_basic;
18	RUN;
19
20	/* 3. Affichage des détails des bins pour vérification (échantillon) */
21	PROC PRINT DATA=mylib.binned_basic (obs=20);
22	RUN;
23

2 Code Block

PROC BINNING Data

Explanation :
This example illustrates how to customize binning by specifying a different number of bins (NUMBIN=5) and using the 'OUTBIN' option. The 'OUTBIN' option creates a separate output table ('bin_details_inter') that contains detailed information about each bin created, such as limits, number of observations, and bin identifiers. This is useful for inspecting the discretization structure.

Copied!

1	/* Création d'une bibliothèque CAS pour les exemples si elle n'existe pas */
2	caslib _all_ assign;
3
4	/* 1. Création des données en mémoire CAS */
5	DATA mylib.data_inter;
6	call streaminit(456); /* Pour la reproductibilité */
7	DO id=1 to 15000;
8	x1 = rand("Uniform");
9	x2 = rand("Normal"); /* Une autre distribution */
10	x3 = rand("Uniform") * 50;
11	OUTPUT;
12	END;
13	RUN;
14
15	/* 2. Binning par quantiles avec options courantes */
16	PROC BINNING DATA=mylib.data_inter numbin=5 method=quantile;
17	INPUT x1 x2 x3; /* Binning sur plusieurs variables */
18	OUTPUT out=mylib.binned_inter;
19	outbin out=mylib.bin_details_inter; /* Sauvegarde les détails des bins */
20	RUN;
21
22	/* 3. Affichage des détails des bins */
23	PROC PRINT DATA=mylib.bin_details_inter;
24	var _VARNAME_ _BINID_ _LOWER_ _UPPER_ _COUNT_;
25	RUN;
26

3 Code Block

PROC BINNING Data

Explanation :
This advanced example shows how to apply quantile binning independently for data subgroups using the 'BY' statement. A 'group' variable is created in the input data, and the BINNING procedure is executed separately for each 'group' value. The 'bin_stats_adv' table is generated to include comprehensive descriptive statistics for each bin and group, such as mean, standard deviation, minimum, and maximum.

Copied!

1	/* Création d'une bibliothèque CAS pour les exemples si elle n'existe pas */
2	caslib _all_ assign;
3
4	/* 1. Création des données en mémoire CAS avec un groupe */
5	DATA mylib.data_adv;
6	call streaminit(789); /* Pour la reproductibilité */
7	DO id=1 to 20000;
8	IF mod(id, 2)=0 THEN group = "A";
9	ELSE group = "B";
10	x1 = rand("Uniform") + (group="A") * 0.5; /* x1 dépend du groupe */
11	x2 = rand("Normal") + (group="B") * 2; /* x2 dépend du groupe */
12	OUTPUT;
13	END;
14	RUN;
15
16	/* 2. Binning par quantiles avec BY group et statistiques de sortie */
17	PROC BINNING DATA=mylib.data_adv numbin=4 method=quantile;
18	BY group; /* Binning effectué séparément pour chaque groupe */
19	INPUT x1 x2;
20	OUTPUT out=mylib.binned_adv;
21	outbin out=mylib.bin_stats_adv;
22	RUN;
23
24	/* 3. Affichage des détails des bins par groupe */
25	PROC PRINT DATA=mylib.bin_stats_adv;
26	BY group;
27	var group _VARNAME_ _BINID_ _LOWER_ _UPPER_ _COUNT_ MEAN STD MIN MAX;
28	RUN;
29

4 Code Block

PROC BINNING Data

Explanation :
This example focuses on SAS Viya and the CAS engine's capabilities for handling very large datasets. It generates 5 million observations directly in CAS memory, then applies the BINNING procedure. Afterwards, it uses 'PROC CASUTIL' to directly interact with the CAS session: list tables, describe the structure of the binned table, and retrieve a sample of the data to verify binning application. This demonstrates a complete approach for CAS-based workflows.

Copied!

1	/* Création d'une bibliothèque CAS pour les exemples si elle n'existe pas */
2	caslib _all_ assign;
3
4	/* 1. Création d'une table CAS très volumineuse */
5	DATA mylib.data_cas_large;
6	call streaminit(101112); /* Pour la reproductibilité */
7	DO id=1 to 5000000; /* 5 millions d'observations */
8	x1 = rand("Uniform");
9	x2 = rand("Exponential");
10	OUTPUT;
11	END;
12	RUN;
13
14	/* 2. Binning par quantiles sur la grande table CAS */
15	PROC BINNING DATA=mylib.data_cas_large numbin=10 method=quantile;
16	INPUT x1 x2;
17	OUTPUT out=mylib.binned_cas_large;
18	outbin out=mylib.bin_stats_cas_large;
19	RUN;
20
21	/* 3. Inspection des propriétés de la table CAS résultante */
22	PROC CASUTIL incaslib="mylib";
23	list files; /* Lister les fichiers dans la caslib pour voir les tables créées */
24	describe TABLE="binned_cas_large"; /* Décrire la structure de la table binnée */
25	fetch casdata="binned_cas_large" to=out; /* Récupérer un échantillon pour vérification */
26	RUN;
27	QUIT;
28
29	/* 4. Affichage d'un échantillon de la table binnée */
30	PROC PRINT DATA=out (obs=10);
31	RUN;
32

This material is provided "as is" by We Are Cas. There are no warranties, expressed or implied, as to merchantability or fitness for a particular purpose regarding the materials or code contained herein. We Are Cas is not responsible for errors in this material as it now exists or will exist, nor does We Are Cas provide technical support for it.

Retour à la liste

Expert Advice

Michael

Responsable de l'infrastructure Viya.

« When using the OUTBIN option, pay close attention to the _LOWER_ and _UPPER_ values. If you see multiple bins with very similar or identical boundaries, it usually indicates a high frequency of "ties" (identical values) in your data. In such cases, you might consider reducing NUMBIN or pre-processing the duplicates to ensure the bins remain statistically meaningful. »