When conversion functions or OR clauses in a SQL join prevent index usage, performance often collapses. The critical best practice is to switch to a Hash Object within a Data Step: this in-memory lookup technique bypasses the need for sorting and Cartesian products, turning hours of processing into mere seconds for complex keys.
One of these critical scenarios occurs during joins involving complex conditions, particularly the use of OR combined with on-the-fly variable transformations in the ON clause.
This article explores how to transform an inefficient SQL query into an ultra-fast Data step using Hash Objects.
Let's imagine the following scenario, which is common in banking or administrative data processing:
You need to join two medium-sized tables (for example, 500,000 rows and 300,000 rows). The challenge is that the join keys are not direct. The identifiers are stored in numeric format, and matching requires converting these numbers into character strings and then extracting specific substrings. Even worse, the join condition contains an OR, meaning a match can occur in two different ways.
Here is what the bottleneck looks like in SQL:
PROC SQL noprint;
CREATE TABLE Resultat AS
SELECT DISTINCT
T1.ID_Parent_L,
T2.CodeCedant,
T2.ID_Client,
T1.Nom
FROM TABLE_PRINCIPALE T1
INNER JOIN TABLE_SECONDAIRE T2
/* La source du problème : conversions et condition OR */
ON (substr(put(T1.ID_Parent_L,14.),7,8) = substr(put(T2.ID_Client,14.),1,8)
OR substr(put(T1.ID_Parent_L,14.),1,8) = substr(put(T2.ID_Client,14.),1,8))
;
QUIT;
1
PROC SQL noprint;
2
CREATE TABLE Resultat AS
3
SELECT DISTINCT
4
T1.ID_Parent_L,
5
T2.CodeCedant,
6
T2.ID_Client,
7
T1.Nom
8
FROM TABLE_PRINCIPALE T1
9
INNER JOIN TABLE_SECONDAIRE T2
10
/* La source du problème : conversions et condition OR */
11
ON (substr(put(T1.ID_Parent_L,14.),7,8) = substr(put(T2.ID_Client,14.),1,8)
12
OR substr(put(T1.ID_Parent_L,14.),1,8) = substr(put(T2.ID_Client,14.),1,8))
The OR clause: Using OR in a join often prevents the SQL optimizer from using indexes efficiently. This frequently forces the system to perform a partial or full "Cartesian product," comparing each row of table T1 with a large portion of the rows in T2.
Absence of unique keys: If the join keys are not unique in both tables, the complexity increases exponentially.
In this type of situation, a classic MERGE join in a Data step is also difficult to implement because it requires a preliminary sort on transformed keys. This is where hash objects come in.
The Solution: The Data Step and Hash Objects
Hash Objects allow you to load an entire table (usually the smaller of the two) into RAM. This creates an ultra-fast lookup structure where a match can be found instantly without needing to sort the data.
The strategy is to:
Prepare the lookup table (the smaller one) by creating a clean search key.
Load this table into a hash object in memory.
Read the large table row by row, calculate the search key on the fly, and query the hash object.
Step 1: Preparing the Lookup Data
Instead of performing complex calculations in the join, we prepare the secondary table (TABLE_SECONDAIRE) to contain a ready-to-use search key (KEY) corresponding to the part of the identifier we are looking for.
PROC SQL;
CREATE TABLE HashData AS
SELECT UNIQUE
CodeCedant,
ID_Client,
/* Création de la clé de jointure standardisée */
put(put(ID_Client,14.),$8.) AS KEY
FROM TABLE_SECONDAIRE;
QUIT;
Here is the code that replaces the complex SQL join.
DATA Resultat_Hash;
/* On lit la table principale ligne par ligne */
SET TABLE_PRINCIPALE (keep=ID_Parent_L Nom);
/* Initialisation de l'objet de hachage (une seule fois au début) */
if _N_=1 then do;
/* Déclaration : on charge la table préparée en mémoire */
dcl hash H(dataset:'HashData');
/* Définition de la clé de recherche */
H.definekey('KEY');
/* Définition des données à récupérer en cas de succès */
H.definedata('CodeCedant','ID_Client');
H.definedone();
/* Astuce pour éviter de lire HashData ici, mais initialiser les variables */
if 0 then set HashData;
end;
/* --- Cœur du traitement --- */
/* Tentative 1 : On calcule la clé potentielle selon la première logique */
RC = H.find(key: put(put(ID_Parent_L,14.),$8.));
/* Si RC (Code Retour) = 0, la recherche a réussi.
Les variables CodeCedant et ID_Client sont remplies automatiquement. */
if RC=0 then output;
else do;
/* Optionnel : Si votre logique SQL initiale avait un "OR",
vous pouvez tenter une deuxième recherche ici si la première échoue.
Exemple :
RC = H.find(key: substr(put(ID_Parent_L,14.),7,8));
if RC=0 then output;
*/
end;
drop RC;
run;
1
DATA Resultat_Hash;
2
/* On lit la table principale ligne par ligne */
3
SET TABLE_PRINCIPALE (keep=ID_Parent_L Nom);
4
5
/* Initialisation de l'objet de hachage (une seule fois au début) */
6
IF _N_=1THENDO;
7
/* Déclaration : on charge la table préparée en mémoire */
8
dcl hash H(dataset:'HashData');
9
/* Définition de la clé de recherche */
10
H.definekey('KEY');
11
/* Définition des données à récupérer en cas de succès */
12
H.definedata('CodeCedant','ID_Client');
13
H.definedone();
14
/* Astuce pour éviter de lire HashData ici, mais initialiser les variables */
15
IF 0 THENSET HashData;
16
END;
17
18
/* --- Cœur du traitement --- */
19
20
/* Tentative 1 : On calcule la clé potentielle selon la première logique */
21
RC = H.find(key: put(put(ID_Parent_L,14.),$8.));
22
23
/* Si RC (Code Retour) = 0, la recherche a réussi.
24
Les variables CodeCedant et ID_Client sont remplies automatiquement. */
25
IF RC=0 THENOUTPUT;
26
ELSEDO;
27
/* Optionnel : Si votre logique SQL initiale avait un "OR",
28
vous pouvez tenter une deuxième recherche ici si la première échoue.
The performance gain when switching from a complex SQL join with OR conditions and conversions to a solution based on Hash Objects is often spectacular. Processes that took hours or crashed the server can run in just a few seconds.
Los códigos y ejemplos proporcionados en WeAreCAS.eu son con fines educativos. Es imperativo no copiarlos y pegarlos ciegamente en sus entornos de producción. El mejor enfoque es comprender la lógica antes de aplicarla. Recomendamos encarecidamente probar estos scripts en un entorno de prueba (Sandbox/Dev). WeAreCAS no acepta ninguna responsabilidad por cualquier impacto o pérdida de datos en sus sistemas.
SAS y todos los demás nombres de productos o servicios de SAS Institute Inc. son marcas registradas o marcas comerciales de SAS Institute Inc. en los EE. UU. y otros países. ® indica registro en los EE. UU. WeAreCAS es un sitio comunitario independiente y no está afiliado a SAS Institute Inc.
Este sitio utiliza cookies técnicas y analíticas para mejorar su experiencia.
Saber más.