Proc SQL

When "Hash Objects" Save a Complex SQL Join

Simon 29/08/2023 4 Aufrufe

In the world of SAS^© programming, a debate persists: which is more performant, PROC SQL or the DATA step? The classic answer is "it depends." However, there are specific scenarios where SQL's performance literally collapses, going from a few minutes to several hours, or even crashing the session.

One of these critical scenarios occurs during joins involving complex conditions, particularly the use of OR combined with on-the-fly variable transformations in the ON clause.

This article explores how to transform an inefficient SQL query into an ultra-fast Data step using Hash Objects.

Inhaltsverzeichnis

The Problem: "Greedy" SQL Joins

Let's imagine the following scenario, which is common in banking or administrative data processing:

You need to join two medium-sized tables (for example, 500,000 rows and 300,000 rows). The challenge is that the join keys are not direct. The identifiers are stored in numeric format, and matching requires converting these numbers into character strings and then extracting specific substrings. Even worse, the join condition contains an OR, meaning a match can occur in two different ways.

Here is what the bottleneck looks like in SQL:

1	PROC SQL noprint;
2	CREATE TABLE Resultat AS
3	SELECT DISTINCT
4	T1.ID_Parent_L,
5	T2.CodeCedant,
6	T2.ID_Client,
7	T1.Nom
8	FROM TABLE_PRINCIPALE T1
9	INNER JOIN TABLE_SECONDAIRE T2
10	/* La source du problème : conversions et condition OR */
11	ON (substr(put(T1.ID_Parent_L,14.),7,8) = substr(put(T2.ID_Client,14.),1,8)
12	OR substr(put(T1.ID_Parent_L,14.),1,8) = substr(put(T2.ID_Client,14.),1,8))
13	;
14	QUIT;

Why is it so slow?

On-the-fly transformations (PUT and SUBSTR): SAS^© must perform these type conversions (numeric to character) and extractions for every potential combination of rows before it can compare them.
The OR clause: Using OR in a join often prevents the SQL optimizer from using indexes efficiently. This frequently forces the system to perform a partial or full "Cartesian product," comparing each row of table T1 with a large portion of the rows in T2.
Absence of unique keys: If the join keys are not unique in both tables, the complexity increases exponentially.

In this type of situation, a classic MERGE join in a Data step is also difficult to implement because it requires a preliminary sort on transformed keys. This is where hash objects come in.

The Solution: The Data Step and Hash Objects

Hash Objects allow you to load an entire table (usually the smaller of the two) into RAM. This creates an ultra-fast lookup structure where a match can be found instantly without needing to sort the data.

The strategy is to:

Prepare the lookup table (the smaller one) by creating a clean search key.
Load this table into a hash object in memory.
Read the large table row by row, calculate the search key on the fly, and query the hash object.

Step 1: Preparing the Lookup Data

Instead of performing complex calculations in the join, we prepare the secondary table (TABLE_SECONDAIRE) to contain a ready-to-use search key (KEY) corresponding to the part of the identifier we are looking for.

1	PROC SQL;
2	CREATE TABLE HashData AS
3	SELECT UNIQUE
4	CodeCedant,
5	ID_Client,
6	/* Création de la clé de jointure standardisée */
7	put(put(ID_Client,14.),$8.) AS KEY
8	FROM TABLE_SECONDAIRE;
9	QUIT;

Step 2: The Data Step with Hash Object

Here is the code that replaces the complex SQL join.

1	DATA Resultat_Hash;
2	/* On lit la table principale ligne par ligne */
3	SET TABLE_PRINCIPALE (keep=ID_Parent_L Nom);
4
5	/* Initialisation de l'objet de hachage (une seule fois au début) */
6	IF _N_=1 THEN DO;
7	/* Déclaration : on charge la table préparée en mémoire */
8	dcl hash H(dataset:'HashData');
9	/* Définition de la clé de recherche */
10	H.definekey('KEY');
11	/* Définition des données à récupérer en cas de succès */
12	H.definedata('CodeCedant','ID_Client');
13	H.definedone();
14	/* Astuce pour éviter de lire HashData ici, mais initialiser les variables */
15	IF 0 THEN SET HashData;
16	END;
17
18	/* --- Cœur du traitement --- */
19
20	/* Tentative 1 : On calcule la clé potentielle selon la première logique */
21	RC = H.find(key: put(put(ID_Parent_L,14.),$8.));
22
23	/* Si RC (Code Retour) = 0, la recherche a réussi.
24	Les variables CodeCedant et ID_Client sont remplies automatiquement. */
25	IF RC=0 THEN OUTPUT;
26	ELSE DO;
27	/* Optionnel : Si votre logique SQL initiale avait un "OR",
28	vous pouvez tenter une deuxième recherche ici si la première échoue.
29	Exemple :
30	RC = H.find(key: substr(put(ID_Parent_L,14.),7,8));
31	if RC=0 then output;
32	*/
33	END;
34
35	drop RC;
36	RUN;

The performance gain when switching from a complex SQL join with OR conditions and conversions to a solution based on Hash Objects is often spectacular. Processes that took hours or crashed the server can run in just a few seconds.

Although the syntax for hash objects is more technical, it is an essential tool in the arsenal of a SAS^© developer dealing with large data volumes and non-standard matching logic.

Wichtiger Haftungsausschluss

Die auf WeAreCAS.eu bereitgestellten Codes und Beispiele dienen Lehrzwecken. Es ist zwingend erforderlich, sie nicht blind in Ihre Produktionsumgebungen zu kopieren. Der beste Ansatz besteht darin, die Logik zu verstehen, bevor sie angewendet wird. Wir empfehlen dringend, diese Skripte in einer Testumgebung (Sandbox/Dev) zu testen. WeAreCAS übernimmt keine Verantwortung für mögliche Auswirkungen oder Datenverluste auf Ihren Systemen.

Zurück zur Artikelliste

Dans : Proc SQL

How to extract a "substring" from a numeric variable?

29/04/2021 • 4

How to filter a table from a list of identifiers with PROC SQL

26/02/2020 • 8

The "Many-to-Many" Trap (MERGE vs. SQL)

24/08/2019 • 4

Ganze Kategorie ansehen

Referenzen & Doku

DATA Step PROC SQL session PUT

SAS und alle anderen Produkt- oder Dienstleistungsnamen von SAS Institute Inc. sind eingetragene Marken oder Marken von SAS Institute Inc. in den USA und anderen Ländern. ® zeigt die Registrierung in den USA an. WeAreCAS ist eine unabhängige Community-Site und nicht mit SAS Institute Inc. verbunden.