One of these critical scenarios occurs during joins involving complex conditions, particularly the use of OR combined with on-the-fly variable transformations in the ON clause.
This article explores how to transform an inefficient SQL query into an ultra-fast Data step using Hash Objects.
Let's imagine the following scenario, which is common in banking or administrative data processing:
You need to join two medium-sized tables (for example, 500,000 rows and 300,000 rows). The challenge is that the join keys are not direct. The identifiers are stored in numeric format, and matching requires converting these numbers into character strings and then extracting specific substrings. Even worse, the join condition contains an OR, meaning a match can occur in two different ways.
Here is what the bottleneck looks like in SQL:
PROC SQL noprint;
CREATE TABLE Resultat AS
SELECT DISTINCT
T1.ID_Parent_L,
T2.CodeCedant,
T2.ID_Client,
T1.Nom
FROM TABLE_PRINCIPALE T1
INNER JOIN TABLE_SECONDAIRE T2
/* La source du problème : conversions et condition OR */
ON (substr(put(T1.ID_Parent_L,14.),7,8) = substr(put(T2.ID_Client,14.),1,8)
OR substr(put(T1.ID_Parent_L,14.),1,8) = substr(put(T2.ID_Client,14.),1,8))
;
QUIT;
1
PROC SQL noprint;
2
CREATE TABLE Resultat AS
3
SELECT DISTINCT
4
T1.ID_Parent_L,
5
T2.CodeCedant,
6
T2.ID_Client,
7
T1.Nom
8
FROM TABLE_PRINCIPALE T1
9
INNER JOIN TABLE_SECONDAIRE T2
10
/* La source du problème : conversions et condition OR */
11
ON (substr(put(T1.ID_Parent_L,14.),7,8) = substr(put(T2.ID_Client,14.),1,8)
12
OR substr(put(T1.ID_Parent_L,14.),1,8) = substr(put(T2.ID_Client,14.),1,8))
The OR clause: Using OR in a join often prevents the SQL optimizer from using indexes efficiently. This frequently forces the system to perform a partial or full "Cartesian product," comparing each row of table T1 with a large portion of the rows in T2.
Absence of unique keys: If the join keys are not unique in both tables, the complexity increases exponentially.
In this type of situation, a classic MERGE join in a Data step is also difficult to implement because it requires a preliminary sort on transformed keys. This is where hash objects come in.
The Solution: The Data Step and Hash Objects
Hash Objects allow you to load an entire table (usually the smaller of the two) into RAM. This creates an ultra-fast lookup structure where a match can be found instantly without needing to sort the data.
The strategy is to:
Prepare the lookup table (the smaller one) by creating a clean search key.
Load this table into a hash object in memory.
Read the large table row by row, calculate the search key on the fly, and query the hash object.
Step 1: Preparing the Lookup Data
Instead of performing complex calculations in the join, we prepare the secondary table (TABLE_SECONDAIRE) to contain a ready-to-use search key (KEY) corresponding to the part of the identifier we are looking for.
PROC SQL;
CREATE TABLE HashData AS
SELECT UNIQUE
CodeCedant,
ID_Client,
/* Création de la clé de jointure standardisée */
put(put(ID_Client,14.),$8.) AS KEY
FROM TABLE_SECONDAIRE;
QUIT;
Here is the code that replaces the complex SQL join.
DATA Resultat_Hash;
/* On lit la table principale ligne par ligne */
SET TABLE_PRINCIPALE (keep=ID_Parent_L Nom);
/* Initialisation de l'objet de hachage (une seule fois au début) */
if _N_=1 then do;
/* Déclaration : on charge la table préparée en mémoire */
dcl hash H(dataset:'HashData');
/* Définition de la clé de recherche */
H.definekey('KEY');
/* Définition des données à récupérer en cas de succès */
H.definedata('CodeCedant','ID_Client');
H.definedone();
/* Astuce pour éviter de lire HashData ici, mais initialiser les variables */
if 0 then set HashData;
end;
/* --- Cœur du traitement --- */
/* Tentative 1 : On calcule la clé potentielle selon la première logique */
RC = H.find(key: put(put(ID_Parent_L,14.),$8.));
/* Si RC (Code Retour) = 0, la recherche a réussi.
Les variables CodeCedant et ID_Client sont remplies automatiquement. */
if RC=0 then output;
else do;
/* Optionnel : Si votre logique SQL initiale avait un "OR",
vous pouvez tenter une deuxième recherche ici si la première échoue.
Exemple :
RC = H.find(key: substr(put(ID_Parent_L,14.),7,8));
if RC=0 then output;
*/
end;
drop RC;
run;
1
DATA Resultat_Hash;
2
/* On lit la table principale ligne par ligne */
3
SET TABLE_PRINCIPALE (keep=ID_Parent_L Nom);
4
5
/* Initialisation de l'objet de hachage (une seule fois au début) */
6
IF _N_=1THENDO;
7
/* Déclaration : on charge la table préparée en mémoire */
8
dcl hash H(dataset:'HashData');
9
/* Définition de la clé de recherche */
10
H.definekey('KEY');
11
/* Définition des données à récupérer en cas de succès */
12
H.definedata('CodeCedant','ID_Client');
13
H.definedone();
14
/* Astuce pour éviter de lire HashData ici, mais initialiser les variables */
15
IF 0 THENSET HashData;
16
END;
17
18
/* --- Cœur du traitement --- */
19
20
/* Tentative 1 : On calcule la clé potentielle selon la première logique */
21
RC = H.find(key: put(put(ID_Parent_L,14.),$8.));
22
23
/* Si RC (Code Retour) = 0, la recherche a réussi.
24
Les variables CodeCedant et ID_Client sont remplies automatiquement. */
25
IF RC=0 THENOUTPUT;
26
ELSEDO;
27
/* Optionnel : Si votre logique SQL initiale avait un "OR",
28
vous pouvez tenter une deuxième recherche ici si la première échoue.
The performance gain when switching from a complex SQL join with OR conditions and conversions to a solution based on Hash Objects is often spectacular. Processes that took hours or crashed the server can run in just a few seconds.
Die auf WeAreCAS.eu bereitgestellten Codes und Beispiele dienen Lehrzwecken. Es ist zwingend erforderlich, sie nicht blind in Ihre Produktionsumgebungen zu kopieren. Der beste Ansatz besteht darin, die Logik zu verstehen, bevor sie angewendet wird. Wir empfehlen dringend, diese Skripte in einer Testumgebung (Sandbox/Dev) zu testen. WeAreCAS übernimmt keine Verantwortung für mögliche Auswirkungen oder Datenverluste auf Ihren Systemen.
SAS und alle anderen Produkt- oder Dienstleistungsnamen von SAS Institute Inc. sind eingetragene Marken oder Marken von SAS Institute Inc. in den USA und anderen Ländern. ® zeigt die Registrierung in den USA an. WeAreCAS ist eine unabhängige Community-Site und nicht mit SAS Institute Inc. verbunden.
Diese Website verwendet technische und analytische Cookies, um Ihre Erfahrung zu verbessern.
Mehr erfahren.