Proc SQL

When "Hash Objects" Save a Complex SQL Join

Simon 29/08/2023 4 Aufrufe

In the world of SAS© programming, a debate persists: which is more performant, PROC SQL or the DATA step? The classic answer is "it depends." However, there are specific scenarios where SQL's performance literally collapses, going from a few minutes to several hours, or even crashing the session.

One of these critical scenarios occurs during joins involving complex conditions, particularly the use of OR combined with on-the-fly variable transformations in the ON clause.

This article explores how to transform an inefficient SQL query into an ultra-fast Data step using Hash Objects.

Illustration

The Problem: "Greedy" SQL Joins

Let's imagine the following scenario, which is common in banking or administrative data processing:

You need to join two medium-sized tables (for example, 500,000 rows and 300,000 rows). The challenge is that the join keys are not direct. The identifiers are stored in numeric format, and matching requires converting these numbers into character strings and then extracting specific substrings. Even worse, the join condition contains an OR, meaning a match can occur in two different ways.

Here is what the bottleneck looks like in SQL:

1PROC SQL noprint;
2 CREATE TABLE Resultat AS
3 SELECT DISTINCT
4 T1.ID_Parent_L,
5 T2.CodeCedant,
6 T2.ID_Client,
7 T1.Nom
8 FROM TABLE_PRINCIPALE T1
9 INNER JOIN TABLE_SECONDAIRE T2
10 /* La source du problème : conversions et condition OR */
11 ON (substr(put(T1.ID_Parent_L,14.),7,8) = substr(put(T2.ID_Client,14.),1,8)
12 OR substr(put(T1.ID_Parent_L,14.),1,8) = substr(put(T2.ID_Client,14.),1,8))
13 ;
14QUIT;

Why is it so slow?

  1. On-the-fly transformations (PUT and SUBSTR): SAS© must perform these type conversions (numeric to character) and extractions for every potential combination of rows before it can compare them.

  2. The OR clause: Using OR in a join often prevents the SQL optimizer from using indexes efficiently. This frequently forces the system to perform a partial or full "Cartesian product," comparing each row of table T1 with a large portion of the rows in T2.

  3. Absence of unique keys: If the join keys are not unique in both tables, the complexity increases exponentially.

In this type of situation, a classic MERGE join in a Data step is also difficult to implement because it requires a preliminary sort on transformed keys. This is where hash objects come in.

The Solution: The Data Step and Hash Objects

Hash Objects allow you to load an entire table (usually the smaller of the two) into RAM. This creates an ultra-fast lookup structure where a match can be found instantly without needing to sort the data.

The strategy is to:

  1. Prepare the lookup table (the smaller one) by creating a clean search key.

  2. Load this table into a hash object in memory.

  3. Read the large table row by row, calculate the search key on the fly, and query the hash object.

Step 1: Preparing the Lookup Data

Instead of performing complex calculations in the join, we prepare the secondary table (TABLE_SECONDAIRE) to contain a ready-to-use search key (KEY) corresponding to the part of the identifier we are looking for.

1PROC SQL;
2 CREATE TABLE HashData AS
3 SELECT UNIQUE
4 CodeCedant,
5 ID_Client,
6 /* Création de la clé de jointure standardisée */
7 put(put(ID_Client,14.),$8.) AS KEY
8 FROM TABLE_SECONDAIRE;
9QUIT;

Step 2: The Data Step with Hash Object

Here is the code that replaces the complex SQL join.

1DATA Resultat_Hash;
2 /* On lit la table principale ligne par ligne */
3 SET TABLE_PRINCIPALE (keep=ID_Parent_L Nom);
4 
5 /* Initialisation de l'objet de hachage (une seule fois au début) */
6 IF _N_=1 THEN DO;
7 /* Déclaration : on charge la table préparée en mémoire */
8 dcl hash H(dataset:'HashData');
9 /* Définition de la clé de recherche */
10 H.definekey('KEY');
11 /* Définition des données à récupérer en cas de succès */
12 H.definedata('CodeCedant','ID_Client');
13 H.definedone();
14 /* Astuce pour éviter de lire HashData ici, mais initialiser les variables */
15 IF 0 THEN SET HashData;
16 END;
17 
18 /* --- Cœur du traitement --- */
19 
20 /* Tentative 1 : On calcule la clé potentielle selon la première logique */
21 RC = H.find(key: put(put(ID_Parent_L,14.),$8.));
22 
23 /* Si RC (Code Retour) = 0, la recherche a réussi.
24 Les variables CodeCedant et ID_Client sont remplies automatiquement. */
25 IF RC=0 THEN OUTPUT;
26 ELSE DO;
27 /* Optionnel : Si votre logique SQL initiale avait un "OR",
28 vous pouvez tenter une deuxième recherche ici si la première échoue.
29 Exemple :
30 RC = H.find(key: substr(put(ID_Parent_L,14.),7,8));
31 if RC=0 then output;
32 */
33 END;
34 
35 drop RC;
36RUN;

The performance gain when switching from a complex SQL join with OR conditions and conversions to a solution based on Hash Objects is often spectacular. Processes that took hours or crashed the server can run in just a few seconds.

Although the syntax for hash objects is more technical, it is an essential tool in the arsenal of a SAS© developer dealing with large data volumes and non-standard matching logic.