One of these critical scenarios occurs during joins involving complex conditions, particularly the use of OR combined with on-the-fly variable transformations in the ON clause.
This article explores how to transform an inefficient SQL query into an ultra-fast Data step using Hash Objects.
Let's imagine the following scenario, which is common in banking or administrative data processing:
You need to join two medium-sized tables (for example, 500,000 rows and 300,000 rows). The challenge is that the join keys are not direct. The identifiers are stored in numeric format, and matching requires converting these numbers into character strings and then extracting specific substrings. Even worse, the join condition contains an OR, meaning a match can occur in two different ways.
Here is what the bottleneck looks like in SQL:
PROC SQL noprint;
CREATE TABLE Resultat AS
SELECT DISTINCT
T1.ID_Parent_L,
T2.CodeCedant,
T2.ID_Client,
T1.Nom
FROM TABLE_PRINCIPALE T1
INNER JOIN TABLE_SECONDAIRE T2
/* La source du problème : conversions et condition OR */
ON (substr(put(T1.ID_Parent_L,14.),7,8) = substr(put(T2.ID_Client,14.),1,8)
OR substr(put(T1.ID_Parent_L,14.),1,8) = substr(put(T2.ID_Client,14.),1,8))
;
QUIT;
1
PROC SQL noprint;
2
CREATE TABLE Resultat AS
3
SELECT DISTINCT
4
T1.ID_Parent_L,
5
T2.CodeCedant,
6
T2.ID_Client,
7
T1.Nom
8
FROM TABLE_PRINCIPALE T1
9
INNER JOIN TABLE_SECONDAIRE T2
10
/* La source du problème : conversions et condition OR */
11
ON (substr(put(T1.ID_Parent_L,14.),7,8) = substr(put(T2.ID_Client,14.),1,8)
12
OR substr(put(T1.ID_Parent_L,14.),1,8) = substr(put(T2.ID_Client,14.),1,8))
The OR clause: Using OR in a join often prevents the SQL optimizer from using indexes efficiently. This frequently forces the system to perform a partial or full "Cartesian product," comparing each row of table T1 with a large portion of the rows in T2.
Absence of unique keys: If the join keys are not unique in both tables, the complexity increases exponentially.
In this type of situation, a classic MERGE join in a Data step is also difficult to implement because it requires a preliminary sort on transformed keys. This is where hash objects come in.
The Solution: The Data Step and Hash Objects
Hash Objects allow you to load an entire table (usually the smaller of the two) into RAM. This creates an ultra-fast lookup structure where a match can be found instantly without needing to sort the data.
The strategy is to:
Prepare the lookup table (the smaller one) by creating a clean search key.
Load this table into a hash object in memory.
Read the large table row by row, calculate the search key on the fly, and query the hash object.
Step 1: Preparing the Lookup Data
Instead of performing complex calculations in the join, we prepare the secondary table (TABLE_SECONDAIRE) to contain a ready-to-use search key (KEY) corresponding to the part of the identifier we are looking for.
PROC SQL;
CREATE TABLE HashData AS
SELECT UNIQUE
CodeCedant,
ID_Client,
/* Création de la clé de jointure standardisée */
put(put(ID_Client,14.),$8.) AS KEY
FROM TABLE_SECONDAIRE;
QUIT;
Here is the code that replaces the complex SQL join.
DATA Resultat_Hash;
/* On lit la table principale ligne par ligne */
SET TABLE_PRINCIPALE (keep=ID_Parent_L Nom);
/* Initialisation de l'objet de hachage (une seule fois au début) */
if _N_=1 then do;
/* Déclaration : on charge la table préparée en mémoire */
dcl hash H(dataset:'HashData');
/* Définition de la clé de recherche */
H.definekey('KEY');
/* Définition des données à récupérer en cas de succès */
H.definedata('CodeCedant','ID_Client');
H.definedone();
/* Astuce pour éviter de lire HashData ici, mais initialiser les variables */
if 0 then set HashData;
end;
/* --- Cœur du traitement --- */
/* Tentative 1 : On calcule la clé potentielle selon la première logique */
RC = H.find(key: put(put(ID_Parent_L,14.),$8.));
/* Si RC (Code Retour) = 0, la recherche a réussi.
Les variables CodeCedant et ID_Client sont remplies automatiquement. */
if RC=0 then output;
else do;
/* Optionnel : Si votre logique SQL initiale avait un "OR",
vous pouvez tenter une deuxième recherche ici si la première échoue.
Exemple :
RC = H.find(key: substr(put(ID_Parent_L,14.),7,8));
if RC=0 then output;
*/
end;
drop RC;
run;
1
DATA Resultat_Hash;
2
/* On lit la table principale ligne par ligne */
3
SET TABLE_PRINCIPALE (keep=ID_Parent_L Nom);
4
5
/* Initialisation de l'objet de hachage (une seule fois au début) */
6
IF _N_=1THENDO;
7
/* Déclaration : on charge la table préparée en mémoire */
8
dcl hash H(dataset:'HashData');
9
/* Définition de la clé de recherche */
10
H.definekey('KEY');
11
/* Définition des données à récupérer en cas de succès */
12
H.definedata('CodeCedant','ID_Client');
13
H.definedone();
14
/* Astuce pour éviter de lire HashData ici, mais initialiser les variables */
15
IF 0 THENSET HashData;
16
END;
17
18
/* --- Cœur du traitement --- */
19
20
/* Tentative 1 : On calcule la clé potentielle selon la première logique */
21
RC = H.find(key: put(put(ID_Parent_L,14.),$8.));
22
23
/* Si RC (Code Retour) = 0, la recherche a réussi.
24
Les variables CodeCedant et ID_Client sont remplies automatiquement. */
25
IF RC=0 THENOUTPUT;
26
ELSEDO;
27
/* Optionnel : Si votre logique SQL initiale avait un "OR",
28
vous pouvez tenter une deuxième recherche ici si la première échoue.
The performance gain when switching from a complex SQL join with OR conditions and conversions to a solution based on Hash Objects is often spectacular. Processes that took hours or crashed the server can run in just a few seconds.
The codes and examples provided on WeAreCAS.eu are for educational purposes. It is imperative not to blindly copy-paste them into your production environments. The best approach is to understand the logic before applying it. We strongly recommend testing these scripts in a test environment (Sandbox/Dev). WeAreCAS accepts no responsibility for any impact or data loss on your systems.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. WeAreCAS is an independent community site and is not affiliated with SAS Institute Inc.
This site uses technical and analytical cookies to improve your experience.
Read more.