SAS9

How to join two tables without a common key

Simon 11 vistas

It often happens in SAS© programming that you need to join two datasets. Usually, we use a MERGE statement in a DATA step or a LEFT/INNER JOIN in SQL via a common key (for example, a customer ID).

But how do you do it when the two tables have no variables in common and the join depends on complex conditions, like values falling within ranges? This is called a Non-Equi Join (Non-Equi Join).

Let's look at a practical case from a user discussion to understand how to solve this problem elegantly with PROC SQL.

The Misleading Good Idea: The Loop in the DATA Step

Faced with this problem, the first intuition is often to try an iterative approach using a DATA step. The idea would be to read a row from the accounts table, then loop through the entire intervals table to find a match.

Although technically possible, this approach is complex to code in SAS©:

  • It requires manipulating two read pointers.

  • You have to manage resetting the read of the intervals table for each account (which might require the POINT= option or multiple SET statements).

  • The code quickly becomes cumbersome and difficult to maintain.

As one contributor in the discussion points out: "I wouldn't waste time on a loop methodology."

The Elegant Solution: PROC SQL and the Cartesian Product


The most effective and readable solution lies in using PROC SQL. Unlike the DATA step, SQL is designed to naturally handle complex relationships between sets without worrying about sort order or explicit loops.

The trick is to perform a cartesian join (all rows from A combined with all rows from B) and immediately filter the result.


Note :
Here is how to solve the problem in a few lines:
1PROC SQL;
2 CREATE TABLE DATA3 AS
3 SELECT
4 t1.grpnm,
5 t2.acct,
6 t2.bal1,
7 t2.bal2
8 FROM
9 intervals t1, /* Table 1 */
10 accounts t2 /* Table 2 */
11 WHERE
12 /* Condition pour le premier solde */
13 t2.bal1 BETWEEN t1.min1 AND t1.max1
14 AND
15 /* Condition pour le second solde */
16 t2.bal2 BETWEEN t1.min2 AND t1.max2;
17QUIT;
Why does this work?
FROM data1, data2: By listing the two tables separated by a comma without an explicit JOIN clause (like LEFT JOIN), SAS© implicitly prepares a Cartesian product. It virtually compares every row from the intervals table with every row from the accounts table.

WHERE ... BETWEEN: This is where the magic happens. Instead of using an equality (ON t1.id = t2.id), we use logical operators (BETWEEN, >, <). SQL only keeps the combinations where the numbers "fall" into the correct ranges.

If you find yourself stuck trying to create a complex loop in a DATA step because you don't have a common key: think SQL.

Don't get locked into the idea that a join requires an identical column. As long as you can define a logical rule (like "being between X and Y"), PROC SQL can do the job much more concisely.